okay a first i wanted to thank the committee for and very here
i data
when i was reading the speech i had a lot of fun because it can
be back to the old days and
i'll cover a lot of history and i have precedent from now but i think
we can learn a lot from
a
hence ventures steaks and so on
a
there's also
very short cycle and research some more weeping to eighteen to twenty years everybody for
us what was done before so it's always
nice to review
like to
you have someone a tremendous before i start and many people were involved in
well the brunt instead i will describe
but i one
special
acknowledgment for my cold calling generally
whose expertise
knowledge and imagination lead to a lot about this crime
so with all that i will proceed might walk
and dropped is asr you and where now in the third day of
asr you and so what is for two days we have lots of talk about
star
but the you has been the thing
so are you trying to somehow feel that
and down
and so on yahoo he's any branch
a larger family of
applications which usually is referred to as natural language processing
in the
natural language processing
usually consists of variety of inputs
to most people unicode or typed input
would seem to be the simplest
does not require transcription and four
most languages you have things like word boundaries and punctuation although when you're typing you
may not i punctuations but
when you do returned you or something that that's the end of your request
but it has certain problems like home of graphs
probably some most words in however problems that occur
when you're trying to get any representation from
the input
i wrote here hardcopy but alignment was hand handwritten input
and a
it shows a lot about the difficulties when typed input
but it has the and the difficulty that
we require transcription
it's not as bad as when you're dealing with hardcopy because it online and you
have a contract
the stroke and consequently
probably get a lot less errors but it still challenging
i speech
in the sense shares the same different properties that
handwriting input chair
but it has done and also of course and speech shot we do have the
problem of deciding what
things like for one
but speech does have a single feature that is not common to there
first two
which is presently
which in these particular system in my opinion is extremely important
one you trying to transcribe
speech just for transcription say
it really doesn't matter
but when you're trying to and for the intended needing
presently may or may not play the role
and by whatever example
a
just a simple
question like
is there is a well
depending on
whether you start toward this or one
there is still remains ambiguous somewhat but you're something
that the response
well be noted that is a book
but when you since the word but
the response time would be you know that is a magazine or no or whatever
so that
that ambiguity is not the resulting recyclable from text alone especially in this situation in
a dialogue situation
also going on
i replaced
the meaning representation with applications
which will result in there in actions or probable probably responses
and
taking actually there
i've taken the liberty of making
three separate
application classes
a these are for my convenience for the car they're by no means
meant to be
a rule or and there is going to be some overlap between
some of these applications
but i will discuss
a these different applications it's like to like talk
but a bit what we can see
and from now on i will take what asr you which means that are going
to have speech and what
and data and you please applications will have to rely on the dialogue system
so in the next slide i will have a chart for example for style of
system
since i can explain so on
so basically
since i used to work for the telephone company the input this telephone but it
could be
just simply a microphone in but
and
next stage is it transcription task
what we
text and customarily that would be
large vocabulary continuous speech recognizer used
in the next stage we're going to try to extract meaning
and a meaning maybe application driven or maybe totally unrestricted
the second one is not within days not you because it requires pool semantic interpretation
but we'll talk a lot about the application environment
semantic rules
and when i say rules here i did not mean manually constructed rules necessarily
finally we get to the dialogue manager that has to make a decision
in there is a response or word error is really detected
a quite well we should
back to the user
and it's an action is necessary then in section will be invoked
and down
i'd like to spend a couple of minutes something about that
language analyser portion of this
and
again i will have a few suggestions for this but by no means these should
be thought of as
all encompassing
so
the simplest method is to use keyword or
free spotting
this immature technology which is very robust to asr
or is it is manually configured
but it is easy to change an application by simply adding
a
content to it
it does require an expert to design
next is what most people referred to as statistical methods
i don't like that because
statistical methods also referred to other aspects
like parsing
so i used a concept of machine learning from parallel corpora
and here you have speech on one side and a result actions on the other
side you can map them
fairly much the way speech translation system
this is of course a is fully automatic
but you do need to obtain data
in many applications that they that would be very easy to acquire
huh
the main drawback is that if you want to change or add something to your
application
need to do
additional training
syntactic analysis
would be very good for some applications
it is not as robust as some of the other technologies
but its application
can be trained with the specific genre or topic
then there
analysis can become very robust
and again it is
quite
easy to
change or extend applications
and it is also helpful in conjunction with asr
for our detection and localisation
you have some and text
just contributes additional information
when necessary for
a
the arguments themselves and
that's predicate argument analysis
very important for
queries which i'll discuss later in my talk
and finally
a deep semantics which are not discuss because it's really is not ready for prime
time
so i will start by discussing call center applications
and this is something
that
we did work on into like nineties
a one lucent was very involved in
small business switching units
the business is huge so it's commercially extremely while
of course it is much larger they and then estimated eighty billion dollars a lot
was quoted
nineties
it could that does not have to replace a human operator just cutting a human
operators time
a could result in tremendous savings
and down
now turn
to probably the first successful be deployed asr you application
which was the at and you operator systems
for this simple application
with natural language and but of course
what is not natural language analysis it was you
word or phrase spotting for only five words
and
right behind press to remember all five words
but i was employed by eighteen P
which at that time was
largest corporation in the world what six hundred
that's operators
so just cutting a few seconds of each
operator can query say accompanied approximately three hundred million dollars here
going back
i have a
list and here for applications for
a call center
call routing and form filling i will discuss in
great detail
unrestricted interactions which would be something like actually
probably my voice
a complete website of well store or business
is something that
will come up in later discussion
she when a an effect you
are not limited by the asr capabilities but by that an L P K bodies
so i will not discussed in much except in my conclusion
so if we start with a
quite a colour for call centre
we actually implement the one below items
ground turn the century
a data
the whiteboard was very question in
and there was a
matrix routing and the confidence scoring
and as well as some destination threshold
if everything was met
the car was routed
if either one of those tail
the system had an option of
we question
a standing
to an operator probably after
a trial
or requesting the user to
the request
but the ones one other branch to this
dialogue system
which ones when we encounter multiple destinations
multiple destination i will explain the night the next slide
this was evaluated
what
bank
an insurance company
what forty
routing destinations
and at that time
despite the fact that the asr was not
at the same level but it is very
wait ninety six percent routing accuracy
which is that the
there
false alarm rate was only about four percent
eight percent of those calls we're up to operator but we did not keep statistics
on how many nodes
where legitimate routes because they request was
totally out of domain and how many were actual classes
so the disambiguation die well
serves two purposes one
the customer may not know
the exact structure of
probably
three
and second would be to combine certain classes so that we have better separation more
success routing
so if the user should i'm looking for a used car alone
there will be only one branch that would a satisfying criterion
but the user may say either alone or track one track
not one of their
words in the vocabulary
then the machine would get them into a long
and start a dialogue
and what S
this is
an existing
i'm sorry one task
but you of the user option is the so called home or personal
once the use of santa carla going to that range
but not because there are only two options
a system would ask is this an existing long and the user signal it's and
one and L
the call one euro successfully
the underlying technology for this
was
want
or train spotting
which was easy to configure
it did require language check expertise
and a
what is extremely accurate especially when routing destinations for mine
and it was easy to a
adopts a new
right
the second alternative for this one again to train from parallel corpora
and
in my opinion it's
a slight overkill
although analysis or the data
would provide the lexicon which could be used for that you more or three spotting
a during the commanding it is up
often
there is often the need
for
verification
or indication of the user
this is sort of an si but i wanted to show you
i
really easy
to enrol system for syndication
because customarily would have their customer quality in times so you can get their voice
so we start with a colour calling for an icon number login or whatever
and it's difficult account does not exist would go to an agent
but if the account does exist
then
we look at the user models and if it's a indicated
then we she can choose not necessarily but it may choose to add
that information
to the customer data for adaptation
however the authentication failed we going to form authentication which would be soaring to the
customer challenge
a questions and they don't wear answered correctly that
user would again we also indicated and their speech would be sent
to the data for training so that the next time they would be automatically verified
it failed again we go to a human operator
so this
is an extremely easy to implement a use paradigm for percent age
next
application
i called form filling application
and it involves many
type of an application such as travel
a reservation
appointment
many simple transactions
and
which could be back to in section are still store transaction
and these type of our application
a there are many fields to be filled
in order to be able to execute
they request
i have taken the liberty of
writing out the script
of what
i generally use one i want to find out that might trained is running on
time
and
this is a less the state-of-the-art in
for form filling up with patients today
as you can see
it's a
very strenuous process
so present
they technology as the one where you computer initiated dialogue
it is well designed for confirmation and does a fairly good job of error detection
but it's not really an example of asr you
and not really the state-of-the-art in the technology
it's just what is available out there today
by contrast
this has nothing to do with me although it is darpa
darpa did run the program whole at this
many years ago
and there's was really a state-of-the-art program
using mixed initiative dialogue
being able to fill many of the entries in the form
with a single utterance with good error detection
and clarification dialogue
and
now application that i showed before would be much better it should look like this
where you can say something like what times the train from new york right in
front of well one
and since you didn't say which what the data was machine just simply know that
was missing in the form
and ask you for that for each
again we look and into the other line technology
my opinion is that this is best served what the
syntactic analysis shallow semantics
is a possibility but
not necessary for most of these applications
so it would be easy to implement
as long as you have
are fairly robust
analysis of the syntax
and
it also may help
that paradigm for machine learning
would be difficult to generalise to other applications but could usable enough training
however more
or phrase spotting would not think of set of structuring solution because
you'd have too many keywords in each phrase uttered in the
okay
i have the signal
i'm going to a
change based now in going to
speech translation application
before continuing like to play a very short segment of videotape
and
i know that your recognizer at least one culprit the video and many of you
will probably recognized extracting
how i'd like to buy pesetas
but i
note this adorable formally and if you kevin
i mean my
here's my passport
what is the exchange rate between us dollars and pesetas
okay so
this finding out that is
the first
bilingual
dialogue or speech-to-speech translation paradigm
not reliable and i'm not sure whether is here today
as disputed
and the parents that cmu
you know balance we first do this
i'm not sure whether he's right on that because
when this was implemented a
there was no asr system the trend in real time computer
and this the of course balanced it can start
special hardware consisting of twelve
the S P modules running parallel seems to be able to the asr in more
or less real time there is slightly later
but
was an accomplishment in that sense
the system
consisted of a speech recognizer
with a
specific grammar for the application
of a lingual parser
we only bilingual translator
not really a translator but it was bilingual translator
to text-to-speech modules
which
a speech out but
it was probably better to describe system and i can see what's actively involved but
i think it was that
around four hundred words
keyboards in each of the system of course the translation what's
quite straightforward since
you know what boards were
a two days a bilingual
you meant dialogue
is quite different
the underlying technology has been replaced by a generalized
so today statistical machine translation
okay
present applications
are quite good
forcing single parent restricted domain applications
they're not as robust but still extremely good for under strict dial
but the single turn
is not accurate enough for multi turned dialogues i think we're all familiar or maybe
not what the all
what the or
a telephone game where you say something to your neighbour and the continues along time
within
it has no resemblance to what the message was originally
and of course this is what will happen
since the to convergence
do not understand each other language
so there's address the can hear need for clarification in this disambiguation
which would result in human-machine dialogue
for the translation happens
and there's also need to understand the context
core friends and so on endeavoured to be able to succeed with a multi turn
freeform conversation
well known to come and the control
i will describe
three applications
a
personally agents
computer user interface by voice and robot control
this is another
project
the last project that we did before
we some closed it's doors on bell labs
which was a personal agent
in those days and then it was quite different and it's to the egg rolls
in two thousand and one i don't think
we force all there
prevalence of
smart
what we don't colour phone
in those days mobile phones
strictly were used for voice
so this type of replication was extremely necessary
so it consisted of a variety of branches we did not get to do too
many of them
but we did manage to
a
do
function for
remote reading and writing of email services
so it was partially implemented at bell labs and two thousand and one
was it will dialogue capabilities
the advantage for this system was that it could
quality and
a lexicon depending on the task
so for example if you're given a day
that you're interested in an email you could collect all the nine
and subjects for that they so the one who pro
so that they to see
and have an email remotely right to down
there was a error detection
and clarification dialogue
but in addition
there was a test task dependent
what men
so this system did not need any startup training
there were quite a few other systems of this nature at that time and they
also for the mice because they required by to have our training
and very few customers for willing to spend time
this is not important
less than that i will touch and lighter in my conclusion
we talk about
computer voice interface
it was originally conceived as that a lengthening interface
because
if you wanted to probe your
computer remotely a
there was another way to do it in as i said that has disappeared to
the
a margin so smart phones
but it does contribute to ease of use
and especially states to handicap
the mouse in this case is and headed
the mentioned for
multimodal use
but of course one could
also use a gestures
and i tracking care your computer is equipped to do that
it does enhance the interactions
so for example if you're word and excel sheet
a
so out of having to write the formulas you could simply
a
verbalise
without the model by saying average on three or
with a mouse simply point
to the column or with your finger
and say average this call
and finally
robotic command and control
okay a
nelson showed us a at all
but time
few weeks ago hours the visiting my granddaughter and she actually has the story and
this is not
a
i think voice response story it's actually
training by the child and does all sorts of things like set and calm and
you can see
my resilience like no
i'm sure many of your seen in the robotic
would be wildly
which was a garbage one thing
the vice robotic device not voice control
this is a device a used by the military to
explore spaces
and that's use bonds
generally it's not used what's voice control but
activated by joystick
what if
the soldiers not have time to wait for it to explore the space before they
would enter
first control would certainly help
and finally
this is a program run the vault that are
what the strange name a big door
i don't know why it's called big door
you all would probably better because it's meant to carry a
a lot of provision so that the soldiers not too late with a
the weight
and a this particular device can certainly use voice control because it is accompanying this
altar and soldier
needs to remain hands-free and i three to be able to operate
so one
it is found in torrance
and extremely useful for both commercial
and military purposes
big tall as they showed before is a companion to a soldier
and it's the perfect setup
for multi modal communication
because when you have your tonight three it certainly is one more natural to select
a big door
go there and point to it
or
have it fall or your gaze
and
on the other thing that i and added here is
the reporters of multimodal communication
could be found in yours
where
there were about itself
wouldn't use gesture
is direction finder
so
i would like not to address
what i think
is necessary for the future
and
obviously for asr
we still have
a problem
where its robustness to noise
channel conditions
i believe that is being worked on
but there is
and word making problem
a language modeling which prevents the technology from being robust
for topic in general
very often
we train
well lots of data for a specific on the right switch to a different genre
a
the accuracy falls very drastically
so i don't believe that we need
spend a lot of effort
researching language models
and i had the luxury few years ago to
have an experiment done
because i was curious as to
how does computer phone like a phonetic transcription relates to implement phonetic transcription
a most people believe that humans are extremely adapted phonetic transcriptions
and i believe that is because many of the experiments that have been done
in transcribing
phonetic so done in artificial settings and results are much higher
then
should be
so
we ran an experiment where we ask human trends to transcribe speech naturally
except that they have no
lexical semantic or you even phonotactic information
to do that
shows two languages with an extremely similar phoneme set
have one set the native speakers speak one language and have another set of native
speakers
transcribe that in their own language
as best they could
experiment was actually cherry with
and additional language i will surely the results for the first two languages which were
japanese
and italian
which have a tremendous overlap phonemes and as you can see here
i guess are had a
thirty four point nine phone error rate
the average human head twenty nine point nine
the best thing when had seventeen point two but the words
much exceeded the machine
humans have no trouble understanding even thirty seven point five percent
phone error rate
experiment was also done by using
spanish and italian
and of course
there is
quite a bit of phonotactic over a wide and some lexical overlap and the results
for
spanish-italian much higher
but when you're bored of any kind of language models and phonotactic models
obviously
the machines are doing almost as well there is really here for about
fifty percent relative improvement
i might add that the recognizer use the here was not that the neural net
recognizer and we're beginning to see that fifteen percent relative improvement
so
maybe some
the machines well matched the human ability to transcribe
going on
people always talk about
prosodic analysis in asr
but data
so far there has been very little research
it's not important
a for transcription
or one way translation
but it's extremely important for dialogue goes
intent
does drive to dial
those of you who've known me in the past will probably wondering why i didn't
say much about text-to-speech so far
but to
that technology has a really taking a turn
in some respects for the better but in many respects for the words
a
it sounds a lot more natural than it did
in the nineties
because of the
all right hmm models and other large vocabulary large data
synthesis
but presently has
fairly much disappeared from text to speech
again it may not be important
if you're expecting a once and actually spawn
but
if you're trying to listen profile paragraph i guarantee that you will not have much
comprehension
the present to me that he of text-to-speech
still does quality evaluations but as part i know
a they don't too much comprehension evaluation of my cat cup with the community so
i'm not sure but i think it would be who
to do
an experiment which we actually did years ago which it's present a very large complex
paragraph
we attacks the speech
and then do college or like
multi
choice questions and see how much is reading
for these applications
error detection and what localisation is extremely important
i make it
and
my computer had problems here
and we need the dialogue for error recovery
also dialogue for help menu is extremely important to facilitate a
applications
and finally
joint optimization between the asr and their application
a quite often
reduces the error for the application
even if it may increase the word error rate for the asr
and we have seen that
repeatedly and
various programs where we're at the either
transcriptions from speech are transcription from and writing
going to speech translation or joint optimization actually all
we cannot do a
for this community for many of the problems that are preventing
certain applications
to become deplorable
there has to be a lot more work in Q and they and the information
retrieval
there has been working on that but i don't believe that the accuracy is such
that would satisfy
kind of customers that
what call into it
may have it does have a lot of value in
more
type of analysis work
but
we have to have
very blessed false alarm and
a lot more
detection
before we can actually do qualities
and i know that
it's my turned back and we well
we will is the giant and information retrieval
and it does have hundred percent recall
but it also had zero percent precision
and one
should not expect
to get
responses
with zero percent precision can we actually for
doing we had
one aspect of gale walls
a
what we call this relation which was a very different responses
targeted
and
when danced
applying where it was important who they want to one
and we had one such example
or was one more prevalent
for those who
to go down the wound up there who
a the first fifty responses by google were all the reverse
well
the gale distillation was actually able to pick
but
still think that there's a lot more work
again there should be a lot more work done in unrestricted bilingual dialogue
what don
one of the things that
prevent this
technology from going
for
is that there is a need for platforms that one
a lot of the platforms
i haven't done as an experiment
so for example
if you have
hey
dialogue system whatsoever about or with your desktop
whenever it encounters an oov if you can explain that word and habit reading that
or whenever it encounters a construction
that it does not understand
comes back with their clarification dialogue and you can explain it
eventually subsystems would become smarter
and better
systems
should be eventually configured to be able to do planning and inference
and finally
although
just before i probably not well i started the program
and grounded language acquisition for the full a i semantics
that did not go very far but i do believe that there's room
to do a lot of research in this area
with my final slide on trips like for a little about the choice of applications
so when designing an application they should be
customers
a standard applications
applications with too many false alarms
that is
a router with too many
but routes
a
engenders lack of trust by the customer
the number of misses is not that's crucial but it is application dependent
because you can always have sort fail for
when you miss
an action
but it is also important
to reduce the cost of enrolment and the cost of learning i specific application
which is usually done by
a machine itself detecting errors and correct them
it is
is important to design compelling applications some applications maybe
easy to implement
but unless they have an urgent
need
the most likely will fail
it's also always wise to ensure that your application
it is compelling selected as an alternative
way to accomplish the task
five there is alternative again the application
well disappear
and
finally i'd like to and this on a real positive know which is good news
and you're all for a
bill gates this and that speech is the most natural form of communication
and
where actually saying at
speech and multi modality despite
their prevalence of smart phones it's not disappear
and
many of their
internet giants are
investing
heavily
speech technology
any questions
some
someone might mistakenly get the impression from the part where you quoted the hot comparatively
low error rates for the machine and on phones
that there's nothing to be done an acoustic part i don't think you think that
is your the other bullet about noise and reverberation where i think probably to machines
and fail much faster than deep
well i
as i said there's still that fifteen percent
and
no more than that i mean because i don't define it should experiment with the
noise level finish my higher
which buttons so there is still that fifteen percent
but also if you know this one of the humans actually did twice as well
as the machine and there's no reason to assume that the machines can do that
well either so yes there is plenty of room for improvement
and this amount fact there's no reason to assume that the machines can do better
than human
there are many tasks
a specifically a speaker verification where machines are more capable than humans to do it
so i'm not saying that a
as far as noise is concerned i would love to learn the same experiment in
or is that was run for clean speech because i think human phonetic recognition is
in noise will drop weight down
just like the machine a
used to use alternative strategies to be able to transcribe speech they don't just use
the phone set
they have a lot more knowledge which is in the language model of the syntax
semantics
and
yes so there is plenty of room to do research and acoustics
but other parts are really lagging we have been start with n-gram models
okay so we will have done as for translation i don't know about for transcription
seven n-gram models space becomes extremely flat and i have always to use the same
example
if i have a bunch of words followed by the that followed by a lot
of words followed by toward shoe followed by a lot of words provided
word then
they're chew bone are much more compelling that there were really hairy black
while sitting you know outside my challenge
so
yes i think that although
many of my colleagues have assured me that it's been trying to i think it
should be tried again
try to find
that are rolling then what we are using
more
so well most of us here with nist maybe one are in the cycle or
out of their of an R N B psycho in speech technologies
or you probably witness like whole bunch of this cycle so is there something that
surprised you
in the last time something that you basically were not expecting and
okay
i would say that to
in this sense nothing surprise me but
i think the technology is continuing on an upward trend in
all aspects of the technology the language as well as the transcription
the cycles are very long and are
we want to wanna get a break through the use you
that points function and the rest of the time they are incremental
i don't know whenever i discussed this nobody seems to recall it but a full
we gave a
and by that for you there i can support interspeech i don't remember which one
it was but it was in hawaii
where he was
the money in the fact that speech recognition improvements and the nineteen eighty five instances
then all the effort has been in application
i don't really what that observation
but progress is very slow down
where lower
near
the ability to
transcribe and restricted word well all genre in all
or be able to understand
and you
so
might don't really basically consisted all
doable application