Speech Transcript - Utilization of ASRU technology - present and future

okay a first i wanted to thank the committee for and very here

i data

when i was reading the speech i had a lot of fun because it can

be back to the old days and

i'll cover a lot of history and i have precedent from now but i think

we can learn a lot from

hence ventures steaks and so on

there's also

very short cycle and research some more weeping to eighteen to twenty years everybody for

us what was done before so it's always

nice to review

like to

you have someone a tremendous before i start and many people were involved in

well the brunt instead i will describe

but i one

special

acknowledgment for my cold calling generally

whose expertise

knowledge and imagination lead to a lot about this crime

so with all that i will proceed might walk

and dropped is asr you and where now in the third day of

asr you and so what is for two days we have lots of talk about

star

but the you has been the thing

so are you trying to somehow feel that

and down

and so on yahoo he's any branch

a larger family of

applications which usually is referred to as natural language processing

in the

natural language processing

usually consists of variety of inputs

to most people unicode or typed input

would seem to be the simplest

does not require transcription and four

most languages you have things like word boundaries and punctuation although when you're typing you

may not i punctuations but

when you do returned you or something that that's the end of your request

but it has certain problems like home of graphs

probably some most words in however problems that occur

when you're trying to get any representation from

the input

i wrote here hardcopy but alignment was hand handwritten input

and a

it shows a lot about the difficulties when typed input

but it has the and the difficulty that

we require transcription

it's not as bad as when you're dealing with hardcopy because it online and you

have a contract

the stroke and consequently

probably get a lot less errors but it still challenging

i speech

in the sense shares the same different properties that

handwriting input chair

but it has done and also of course and speech shot we do have the

problem of deciding what

things like for one

but speech does have a single feature that is not common to there

first two

which is presently

which in these particular system in my opinion is extremely important

one you trying to transcribe

speech just for transcription say

it really doesn't matter

but when you're trying to and for the intended needing

presently may or may not play the role

and by whatever example

just a simple

question like

is there is a well

depending on

whether you start toward this or one

there is still remains ambiguous somewhat but you're something

that the response

well be noted that is a book

but when you since the word but

the response time would be you know that is a magazine or no or whatever

so that

that ambiguity is not the resulting recyclable from text alone especially in this situation in

a dialogue situation

also going on

i replaced

the meaning representation with applications

which will result in there in actions or probable probably responses

and

taking actually there

i've taken the liberty of making

three separate

application classes

a these are for my convenience for the car they're by no means

meant to be

a rule or and there is going to be some overlap between

some of these applications

but i will discuss

a these different applications it's like to like talk

but a bit what we can see

and from now on i will take what asr you which means that are going

to have speech and what

and data and you please applications will have to rely on the dialogue system

so in the next slide i will have a chart for example for style of

system

since i can explain so on

so basically

since i used to work for the telephone company the input this telephone but it

could be

just simply a microphone in but

and

next stage is it transcription task

what we

text and customarily that would be

large vocabulary continuous speech recognizer used

in the next stage we're going to try to extract meaning

and a meaning maybe application driven or maybe totally unrestricted

the second one is not within days not you because it requires pool semantic interpretation

but we'll talk a lot about the application environment

semantic rules

and when i say rules here i did not mean manually constructed rules necessarily

finally we get to the dialogue manager that has to make a decision

in there is a response or word error is really detected

a quite well we should

back to the user

and it's an action is necessary then in section will be invoked

and down

i'd like to spend a couple of minutes something about that

language analyser portion of this

and

again i will have a few suggestions for this but by no means these should

be thought of as

all encompassing

the simplest method is to use keyword or

free spotting

this immature technology which is very robust to asr

or is it is manually configured

but it is easy to change an application by simply adding

content to it

it does require an expert to design

next is what most people referred to as statistical methods

i don't like that because

statistical methods also referred to other aspects

like parsing

so i used a concept of machine learning from parallel corpora

and here you have speech on one side and a result actions on the other

side you can map them

fairly much the way speech translation system

this is of course a is fully automatic

but you do need to obtain data

in many applications that they that would be very easy to acquire

huh

the main drawback is that if you want to change or add something to your

application

need to do

additional training

syntactic analysis

would be very good for some applications

it is not as robust as some of the other technologies

but its application

can be trained with the specific genre or topic

then there

analysis can become very robust

and again it is

quite

easy to

change or extend applications

and it is also helpful in conjunction with asr

for our detection and localisation

you have some and text

just contributes additional information

when necessary for

the arguments themselves and

that's predicate argument analysis

very important for

queries which i'll discuss later in my talk

and finally

a deep semantics which are not discuss because it's really is not ready for prime

time

so i will start by discussing call center applications

and this is something

that

we did work on into like nineties

a one lucent was very involved in

small business switching units

the business is huge so it's commercially extremely while

of course it is much larger they and then estimated eighty billion dollars a lot

was quoted

nineties

it could that does not have to replace a human operator just cutting a human

operators time

a could result in tremendous savings

and down

now turn

to probably the first successful be deployed asr you application

which was the at and you operator systems

for this simple application

with natural language and but of course

what is not natural language analysis it was you

word or phrase spotting for only five words

and

right behind press to remember all five words

but i was employed by eighteen P

which at that time was

largest corporation in the world what six hundred

that's operators

so just cutting a few seconds of each

operator can query say accompanied approximately three hundred million dollars here

going back

i have a

list and here for applications for

a call center

call routing and form filling i will discuss in

great detail

unrestricted interactions which would be something like actually

probably my voice

a complete website of well store or business

is something that

will come up in later discussion

she when a an effect you

are not limited by the asr capabilities but by that an L P K bodies

so i will not discussed in much except in my conclusion

so if we start with a

quite a colour for call centre

we actually implement the one below items

ground turn the century

a data

the whiteboard was very question in

and there was a

matrix routing and the confidence scoring

and as well as some destination threshold

if everything was met

the car was routed

if either one of those tail

the system had an option of

we question

a standing

to an operator probably after

a trial

or requesting the user to

the request

but the ones one other branch to this

dialogue system

which ones when we encounter multiple destinations

multiple destination i will explain the night the next slide

this was evaluated

what

bank

an insurance company

what forty

routing destinations

and at that time

despite the fact that the asr was not

at the same level but it is very

wait ninety six percent routing accuracy

which is that the

there

false alarm rate was only about four percent

eight percent of those calls we're up to operator but we did not keep statistics

on how many nodes

where legitimate routes because they request was

totally out of domain and how many were actual classes

so the disambiguation die well

serves two purposes one

the customer may not know

the exact structure of

probably

three

and second would be to combine certain classes so that we have better separation more

success routing

so if the user should i'm looking for a used car alone

there will be only one branch that would a satisfying criterion

but the user may say either alone or track one track

not one of their

words in the vocabulary

then the machine would get them into a long

and start a dialogue

and what S

this is

an existing

i'm sorry one task

but you of the user option is the so called home or personal

once the use of santa carla going to that range

but not because there are only two options

a system would ask is this an existing long and the user signal it's and

one and L

the call one euro successfully

the underlying technology for this

was

want

or train spotting

which was easy to configure

it did require language check expertise

and a

what is extremely accurate especially when routing destinations for mine

and it was easy to a

adopts a new

right

the second alternative for this one again to train from parallel corpora

and

in my opinion it's

a slight overkill

although analysis or the data

would provide the lexicon which could be used for that you more or three spotting

a during the commanding it is up

often

there is often the need

for

verification

or indication of the user

this is sort of an si but i wanted to show you

really easy

to enrol system for syndication

because customarily would have their customer quality in times so you can get their voice

so we start with a colour calling for an icon number login or whatever

and it's difficult account does not exist would go to an agent

but if the account does exist

then

we look at the user models and if it's a indicated

then we she can choose not necessarily but it may choose to add

that information

to the customer data for adaptation

however the authentication failed we going to form authentication which would be soaring to the

customer challenge

a questions and they don't wear answered correctly that

user would again we also indicated and their speech would be sent

to the data for training so that the next time they would be automatically verified

it failed again we go to a human operator

so this

is an extremely easy to implement a use paradigm for percent age

application

i called form filling application

and it involves many

type of an application such as travel

a reservation

appointment

many simple transactions

and

which could be back to in section are still store transaction

and these type of our application

a there are many fields to be filled

in order to be able to execute

they request

i have taken the liberty of

writing out the script

of what

i generally use one i want to find out that might trained is running on

time

and

this is a less the state-of-the-art in

for form filling up with patients today

as you can see

it's a

very strenuous process

so present

they technology as the one where you computer initiated dialogue

it is well designed for confirmation and does a fairly good job of error detection

but it's not really an example of asr you

and not really the state-of-the-art in the technology

it's just what is available out there today

by contrast

this has nothing to do with me although it is darpa

darpa did run the program whole at this

many years ago

and there's was really a state-of-the-art program

using mixed initiative dialogue

being able to fill many of the entries in the form

with a single utterance with good error detection

and clarification dialogue

and

now application that i showed before would be much better it should look like this

where you can say something like what times the train from new york right in

front of well one

and since you didn't say which what the data was machine just simply know that

was missing in the form

and ask you for that for each

again we look and into the other line technology

my opinion is that this is best served what the

syntactic analysis shallow semantics

is a possibility but

not necessary for most of these applications

so it would be easy to implement

as long as you have

are fairly robust

analysis of the syntax

and

it also may help

that paradigm for machine learning

would be difficult to generalise to other applications but could usable enough training

however more

or phrase spotting would not think of set of structuring solution because

you'd have too many keywords in each phrase uttered in the

okay

i have the signal

i'm going to a

change based now in going to

speech translation application

before continuing like to play a very short segment of videotape

and

i know that your recognizer at least one culprit the video and many of you

will probably recognized extracting

how i'd like to buy pesetas

but i

note this adorable formally and if you kevin

i mean my

here's my passport

what is the exchange rate between us dollars and pesetas

okay so

this finding out that is

the first

bilingual

dialogue or speech-to-speech translation paradigm

not reliable and i'm not sure whether is here today

as disputed

and the parents that cmu

you know balance we first do this

i'm not sure whether he's right on that because

when this was implemented a

there was no asr system the trend in real time computer

and this the of course balanced it can start

special hardware consisting of twelve

the S P modules running parallel seems to be able to the asr in more

or less real time there is slightly later

but

was an accomplishment in that sense

the system

consisted of a speech recognizer

with a

specific grammar for the application

of a lingual parser

we only bilingual translator

not really a translator but it was bilingual translator

to text-to-speech modules

which

a speech out but

it was probably better to describe system and i can see what's actively involved but

i think it was that

around four hundred words

keyboards in each of the system of course the translation what's

quite straightforward since

you know what boards were

a two days a bilingual

you meant dialogue

is quite different

the underlying technology has been replaced by a generalized

so today statistical machine translation

okay

present applications

are quite good

forcing single parent restricted domain applications

they're not as robust but still extremely good for under strict dial

but the single turn

is not accurate enough for multi turned dialogues i think we're all familiar or maybe

not what the all

what the or

a telephone game where you say something to your neighbour and the continues along time

within

it has no resemblance to what the message was originally

and of course this is what will happen

since the to convergence

do not understand each other language

so there's address the can hear need for clarification in this disambiguation

which would result in human-machine dialogue

for the translation happens

and there's also need to understand the context

core friends and so on endeavoured to be able to succeed with a multi turn

freeform conversation

well known to come and the control

i will describe

three applications

personally agents

computer user interface by voice and robot control

this is another

project

the last project that we did before

we some closed it's doors on bell labs

which was a personal agent

in those days and then it was quite different and it's to the egg rolls

in two thousand and one i don't think

we force all there

prevalence of

smart

what we don't colour phone

in those days mobile phones

strictly were used for voice

so this type of replication was extremely necessary

so it consisted of a variety of branches we did not get to do too

many of them

but we did manage to

function for

remote reading and writing of email services

so it was partially implemented at bell labs and two thousand and one

was it will dialogue capabilities

the advantage for this system was that it could

quality and

a lexicon depending on the task

so for example if you're given a day

that you're interested in an email you could collect all the nine

and subjects for that they so the one who pro

so that they to see

and have an email remotely right to down

there was a error detection

and clarification dialogue

but in addition

there was a test task dependent

what men

so this system did not need any startup training

there were quite a few other systems of this nature at that time and they

also for the mice because they required by to have our training

and very few customers for willing to spend time

this is not important

less than that i will touch and lighter in my conclusion

we talk about

computer voice interface

it was originally conceived as that a lengthening interface

because

if you wanted to probe your

computer remotely a

there was another way to do it in as i said that has disappeared to

the

a margin so smart phones

but it does contribute to ease of use

and especially states to handicap

the mouse in this case is and headed

the mentioned for

multimodal use

but of course one could

also use a gestures

and i tracking care your computer is equipped to do that

it does enhance the interactions

so for example if you're word and excel sheet

so out of having to write the formulas you could simply

verbalise

without the model by saying average on three or

with a mouse simply point

to the column or with your finger

and say average this call

and finally

robotic command and control

okay a

nelson showed us a at all

but time

few weeks ago hours the visiting my granddaughter and she actually has the story and

this is not

i think voice response story it's actually

training by the child and does all sorts of things like set and calm and

you can see

my resilience like no

i'm sure many of your seen in the robotic

would be wildly

which was a garbage one thing

the vice robotic device not voice control

this is a device a used by the military to

explore spaces

and that's use bonds

generally it's not used what's voice control but

activated by joystick

what if

the soldiers not have time to wait for it to explore the space before they

would enter

first control would certainly help

and finally

this is a program run the vault that are

what the strange name a big door

i don't know why it's called big door

you all would probably better because it's meant to carry a

a lot of provision so that the soldiers not too late with a

the weight

and a this particular device can certainly use voice control because it is accompanying this

altar and soldier

needs to remain hands-free and i three to be able to operate

so one

it is found in torrance

and extremely useful for both commercial

and military purposes

big tall as they showed before is a companion to a soldier

and it's the perfect setup

for multi modal communication

because when you have your tonight three it certainly is one more natural to select

a big door

go there and point to it

have it fall or your gaze

and

on the other thing that i and added here is

the reporters of multimodal communication

could be found in yours

where

there were about itself

wouldn't use gesture

is direction finder

i would like not to address

what i think

is necessary for the future

and

obviously for asr

we still have

a problem

where its robustness to noise

channel conditions

i believe that is being worked on

but there is

and word making problem

a language modeling which prevents the technology from being robust

for topic in general

very often

we train

well lots of data for a specific on the right switch to a different genre

the accuracy falls very drastically

so i don't believe that we need

spend a lot of effort

researching language models

and i had the luxury few years ago to

have an experiment done

because i was curious as to

how does computer phone like a phonetic transcription relates to implement phonetic transcription

a most people believe that humans are extremely adapted phonetic transcriptions

and i believe that is because many of the experiments that have been done

in transcribing

phonetic so done in artificial settings and results are much higher

then

should be

we ran an experiment where we ask human trends to transcribe speech naturally

except that they have no

lexical semantic or you even phonotactic information

to do that

shows two languages with an extremely similar phoneme set

have one set the native speakers speak one language and have another set of native

speakers

transcribe that in their own language

as best they could

experiment was actually cherry with

and additional language i will surely the results for the first two languages which were

japanese

and italian

which have a tremendous overlap phonemes and as you can see here

i guess are had a

thirty four point nine phone error rate

the average human head twenty nine point nine

the best thing when had seventeen point two but the words

much exceeded the machine

humans have no trouble understanding even thirty seven point five percent

phone error rate

experiment was also done by using

spanish and italian

and of course

there is

quite a bit of phonotactic over a wide and some lexical overlap and the results

for

spanish-italian much higher

but when you're bored of any kind of language models and phonotactic models

obviously

the machines are doing almost as well there is really here for about

fifty percent relative improvement

i might add that the recognizer use the here was not that the neural net

recognizer and we're beginning to see that fifteen percent relative improvement

maybe some

the machines well matched the human ability to transcribe

going on

people always talk about

prosodic analysis in asr

but data

so far there has been very little research

it's not important

a for transcription

or one way translation

but it's extremely important for dialogue goes

intent

does drive to dial

those of you who've known me in the past will probably wondering why i didn't

say much about text-to-speech so far

but to

that technology has a really taking a turn

in some respects for the better but in many respects for the words

it sounds a lot more natural than it did

in the nineties

because of the

all right hmm models and other large vocabulary large data

synthesis

but presently has

fairly much disappeared from text to speech

again it may not be important

if you're expecting a once and actually spawn

but

if you're trying to listen profile paragraph i guarantee that you will not have much

comprehension

the present to me that he of text-to-speech

still does quality evaluations but as part i know

a they don't too much comprehension evaluation of my cat cup with the community so

i'm not sure but i think it would be who

to do

an experiment which we actually did years ago which it's present a very large complex

paragraph

we attacks the speech

and then do college or like

multi

choice questions and see how much is reading

for these applications

error detection and what localisation is extremely important

i make it

and

my computer had problems here

and we need the dialogue for error recovery

also dialogue for help menu is extremely important to facilitate a

applications

and finally

joint optimization between the asr and their application

a quite often

reduces the error for the application

even if it may increase the word error rate for the asr

and we have seen that

repeatedly and

various programs where we're at the either

transcriptions from speech are transcription from and writing

going to speech translation or joint optimization actually all

we cannot do a

for this community for many of the problems that are preventing

certain applications

to become deplorable

there has to be a lot more work in Q and they and the information

retrieval

there has been working on that but i don't believe that the accuracy is such

that would satisfy

kind of customers that

what call into it

may have it does have a lot of value in

type of analysis work

but

we have to have

very blessed false alarm and

a lot more

detection

before we can actually do qualities

and i know that

it's my turned back and we well

we will is the giant and information retrieval

and it does have hundred percent recall

but it also had zero percent precision

and one

should not expect

to get

responses

with zero percent precision can we actually for

doing we had

one aspect of gale walls

what we call this relation which was a very different responses

targeted

and

when danced

applying where it was important who they want to one

and we had one such example

or was one more prevalent

for those who

to go down the wound up there who

a the first fifty responses by google were all the reverse

well

the gale distillation was actually able to pick

but

still think that there's a lot more work

again there should be a lot more work done in unrestricted bilingual dialogue

what don

one of the things that

prevent this

technology from going

for

is that there is a need for platforms that one

a lot of the platforms

i haven't done as an experiment

so for example

if you have

hey

dialogue system whatsoever about or with your desktop

whenever it encounters an oov if you can explain that word and habit reading that

or whenever it encounters a construction

that it does not understand

comes back with their clarification dialogue and you can explain it

eventually subsystems would become smarter

and better

systems

should be eventually configured to be able to do planning and inference

and finally

although

just before i probably not well i started the program

and grounded language acquisition for the full a i semantics

that did not go very far but i do believe that there's room

to do a lot of research in this area

with my final slide on trips like for a little about the choice of applications

so when designing an application they should be

customers

a standard applications

applications with too many false alarms

that is

a router with too many

but routes

engenders lack of trust by the customer

the number of misses is not that's crucial but it is application dependent

because you can always have sort fail for

when you miss

an action

but it is also important

to reduce the cost of enrolment and the cost of learning i specific application

which is usually done by

a machine itself detecting errors and correct them

it is

is important to design compelling applications some applications maybe

easy to implement

but unless they have an urgent

need

the most likely will fail

it's also always wise to ensure that your application

it is compelling selected as an alternative

way to accomplish the task

five there is alternative again the application

well disappear

and

finally i'd like to and this on a real positive know which is good news

and you're all for a

bill gates this and that speech is the most natural form of communication

and

where actually saying at

speech and multi modality despite

their prevalence of smart phones it's not disappear

and

many of their

internet giants are

investing

heavily

speech technology

any questions

some

someone might mistakenly get the impression from the part where you quoted the hot comparatively

low error rates for the machine and on phones

that there's nothing to be done an acoustic part i don't think you think that

is your the other bullet about noise and reverberation where i think probably to machines

and fail much faster than deep

well i

as i said there's still that fifteen percent

and

no more than that i mean because i don't define it should experiment with the

noise level finish my higher

which buttons so there is still that fifteen percent

but also if you know this one of the humans actually did twice as well

as the machine and there's no reason to assume that the machines can do that

well either so yes there is plenty of room for improvement

and this amount fact there's no reason to assume that the machines can do better

than human

there are many tasks

a specifically a speaker verification where machines are more capable than humans to do it

so i'm not saying that a

as far as noise is concerned i would love to learn the same experiment in

or is that was run for clean speech because i think human phonetic recognition is

in noise will drop weight down

just like the machine a

used to use alternative strategies to be able to transcribe speech they don't just use

the phone set

they have a lot more knowledge which is in the language model of the syntax

semantics

and

yes so there is plenty of room to do research and acoustics

but other parts are really lagging we have been start with n-gram models

okay so we will have done as for translation i don't know about for transcription

seven n-gram models space becomes extremely flat and i have always to use the same

example

if i have a bunch of words followed by the that followed by a lot

of words followed by toward shoe followed by a lot of words provided

word then

they're chew bone are much more compelling that there were really hairy black

while sitting you know outside my challenge

yes i think that although

many of my colleagues have assured me that it's been trying to i think it

should be tried again

try to find

that are rolling then what we are using

so well most of us here with nist maybe one are in the cycle or

out of their of an R N B psycho in speech technologies

or you probably witness like whole bunch of this cycle so is there something that

surprised you

in the last time something that you basically were not expecting and

okay

i would say that to

in this sense nothing surprise me but

i think the technology is continuing on an upward trend in

all aspects of the technology the language as well as the transcription

the cycles are very long and are

we want to wanna get a break through the use you

that points function and the rest of the time they are incremental

i don't know whenever i discussed this nobody seems to recall it but a full

we gave a

and by that for you there i can support interspeech i don't remember which one

it was but it was in hawaii

where he was

the money in the fact that speech recognition improvements and the nineteen eighty five instances

then all the effort has been in application

i don't really what that observation

but progress is very slow down

where lower

near

the ability to

transcribe and restricted word well all genre in all

or be able to understand

and you

might don't really basically consisted all

doable application

Utilization of ASRU technology - present and future

Applications Day

Joseph Olive