0:00:16okay a first i wanted to thank the committee for and very here
0:00:22i data
0:00:24when i was reading the speech i had a lot of fun because it can
0:00:27be back to the old days and
0:00:30i'll cover a lot of history and i have precedent from now but i think
0:00:36we can learn a lot from
0:00:38a
0:00:40hence ventures steaks and so on
0:00:45a
0:00:47there's also
0:00:49very short cycle and research some more weeping to eighteen to twenty years everybody for
0:00:55us what was done before so it's always
0:00:58nice to review
0:01:03like to
0:01:05you have someone a tremendous before i start and many people were involved in
0:01:11well the brunt instead i will describe
0:01:14but i one
0:01:16special
0:01:17acknowledgment for my cold calling generally
0:01:21whose expertise
0:01:23knowledge and imagination lead to a lot about this crime
0:01:30so with all that i will proceed might walk
0:01:34and dropped is asr you and where now in the third day of
0:01:41asr you and so what is for two days we have lots of talk about
0:01:46star
0:01:48but the you has been the thing
0:01:51so are you trying to somehow feel that
0:01:56and down
0:01:57and so on yahoo he's any branch
0:02:00a larger family of
0:02:03applications which usually is referred to as natural language processing
0:02:10in the
0:02:11natural language processing
0:02:14usually consists of variety of inputs
0:02:19to most people unicode or typed input
0:02:24would seem to be the simplest
0:02:27does not require transcription and four
0:02:31most languages you have things like word boundaries and punctuation although when you're typing you
0:02:37may not i punctuations but
0:02:39when you do returned you or something that that's the end of your request
0:02:44but it has certain problems like home of graphs
0:02:50probably some most words in however problems that occur
0:02:54when you're trying to get any representation from
0:02:58the input
0:03:01i wrote here hardcopy but alignment was hand handwritten input
0:03:07and a
0:03:09it shows a lot about the difficulties when typed input
0:03:14but it has the and the difficulty that
0:03:16we require transcription
0:03:19it's not as bad as when you're dealing with hardcopy because it online and you
0:03:23have a contract
0:03:26the stroke and consequently
0:03:29probably get a lot less errors but it still challenging
0:03:33i speech
0:03:35in the sense shares the same different properties that
0:03:38handwriting input chair
0:03:41but it has done and also of course and speech shot we do have the
0:03:45problem of deciding what
0:03:47things like for one
0:03:50but speech does have a single feature that is not common to there
0:03:54first two
0:03:55which is presently
0:03:57which in these particular system in my opinion is extremely important
0:04:05one you trying to transcribe
0:04:07speech just for transcription say
0:04:11it really doesn't matter
0:04:13but when you're trying to and for the intended needing
0:04:17presently may or may not play the role
0:04:20and by whatever example
0:04:23a
0:04:25just a simple
0:04:27question like
0:04:28is there is a well
0:04:30depending on
0:04:32whether you start toward this or one
0:04:35there is still remains ambiguous somewhat but you're something
0:04:39that the response
0:04:40well be noted that is a book
0:04:43but when you since the word but
0:04:45the response time would be you know that is a magazine or no or whatever
0:04:50so that
0:04:52that ambiguity is not the resulting recyclable from text alone especially in this situation in
0:04:59a dialogue situation
0:05:02also going on
0:05:06i replaced
0:05:07the meaning representation with applications
0:05:11which will result in there in actions or probable probably responses
0:05:17and
0:05:19taking actually there
0:05:21i've taken the liberty of making
0:05:24three separate
0:05:27application classes
0:05:29a these are for my convenience for the car they're by no means
0:05:34meant to be
0:05:35a rule or and there is going to be some overlap between
0:05:41some of these applications
0:05:43but i will discuss
0:05:45a these different applications it's like to like talk
0:05:50but a bit what we can see
0:05:54and from now on i will take what asr you which means that are going
0:05:58to have speech and what
0:06:02and data and you please applications will have to rely on the dialogue system
0:06:08so in the next slide i will have a chart for example for style of
0:06:14system
0:06:15since i can explain so on
0:06:18so basically
0:06:20since i used to work for the telephone company the input this telephone but it
0:06:25could be
0:06:26just simply a microphone in but
0:06:29and
0:06:31next stage is it transcription task
0:06:34what we
0:06:36text and customarily that would be
0:06:40large vocabulary continuous speech recognizer used
0:06:44in the next stage we're going to try to extract meaning
0:06:49and a meaning maybe application driven or maybe totally unrestricted
0:06:55the second one is not within days not you because it requires pool semantic interpretation
0:07:04but we'll talk a lot about the application environment
0:07:08semantic rules
0:07:10and when i say rules here i did not mean manually constructed rules necessarily
0:07:17finally we get to the dialogue manager that has to make a decision
0:07:22in there is a response or word error is really detected
0:07:28a quite well we should
0:07:29back to the user
0:07:33and it's an action is necessary then in section will be invoked
0:07:39and down
0:07:40i'd like to spend a couple of minutes something about that
0:07:46language analyser portion of this
0:07:49and
0:07:51again i will have a few suggestions for this but by no means these should
0:07:55be thought of as
0:07:57all encompassing
0:08:03so
0:08:04the simplest method is to use keyword or
0:08:09free spotting
0:08:10this immature technology which is very robust to asr
0:08:15or is it is manually configured
0:08:19but it is easy to change an application by simply adding
0:08:24a
0:08:25content to it
0:08:27it does require an expert to design
0:08:31next is what most people referred to as statistical methods
0:08:36i don't like that because
0:08:39statistical methods also referred to other aspects
0:08:43like parsing
0:08:45so i used a concept of machine learning from parallel corpora
0:08:50and here you have speech on one side and a result actions on the other
0:08:55side you can map them
0:08:58fairly much the way speech translation system
0:09:02this is of course a is fully automatic
0:09:06but you do need to obtain data
0:09:09in many applications that they that would be very easy to acquire
0:09:14huh
0:09:15the main drawback is that if you want to change or add something to your
0:09:20application
0:09:21need to do
0:09:23additional training
0:09:28syntactic analysis
0:09:30would be very good for some applications
0:09:34it is not as robust as some of the other technologies
0:09:39but its application
0:09:42can be trained with the specific genre or topic
0:09:47then there
0:09:49analysis can become very robust
0:09:54and again it is
0:09:59quite
0:09:59easy to
0:10:02change or extend applications
0:10:05and it is also helpful in conjunction with asr
0:10:09for our detection and localisation
0:10:14you have some and text
0:10:15just contributes additional information
0:10:18when necessary for
0:10:20a
0:10:21the arguments themselves and
0:10:24that's predicate argument analysis
0:10:27very important for
0:10:30queries which i'll discuss later in my talk
0:10:33and finally
0:10:35a deep semantics which are not discuss because it's really is not ready for prime
0:10:40time
0:10:46so i will start by discussing call center applications
0:10:49and this is something
0:10:52that
0:10:54we did work on into like nineties
0:10:56a one lucent was very involved in
0:11:01small business switching units
0:11:06the business is huge so it's commercially extremely while
0:11:12of course it is much larger they and then estimated eighty billion dollars a lot
0:11:16was quoted
0:11:19nineties
0:11:21it could that does not have to replace a human operator just cutting a human
0:11:25operators time
0:11:27a could result in tremendous savings
0:11:32and down
0:11:34now turn
0:11:36to probably the first successful be deployed asr you application
0:11:42which was the at and you operator systems
0:11:46for this simple application
0:11:49with natural language and but of course
0:11:53what is not natural language analysis it was you
0:11:57word or phrase spotting for only five words
0:12:02and
0:12:03right behind press to remember all five words
0:12:07but i was employed by eighteen P
0:12:10which at that time was
0:12:12largest corporation in the world what six hundred
0:12:17that's operators
0:12:19so just cutting a few seconds of each
0:12:23operator can query say accompanied approximately three hundred million dollars here
0:12:31going back
0:12:33i have a
0:12:35list and here for applications for
0:12:39a call center
0:12:43call routing and form filling i will discuss in
0:12:47great detail
0:12:50unrestricted interactions which would be something like actually
0:12:54probably my voice
0:12:55a complete website of well store or business
0:13:01is something that
0:13:02will come up in later discussion
0:13:06she when a an effect you
0:13:09are not limited by the asr capabilities but by that an L P K bodies
0:13:16so i will not discussed in much except in my conclusion
0:13:21so if we start with a
0:13:25quite a colour for call centre
0:13:28we actually implement the one below items
0:13:31ground turn the century
0:13:34a data
0:13:36the whiteboard was very question in
0:13:40and there was a
0:13:42matrix routing and the confidence scoring
0:13:45and as well as some destination threshold
0:13:50if everything was met
0:13:52the car was routed
0:13:56if either one of those tail
0:13:59the system had an option of
0:14:02we question
0:14:05a standing
0:14:07to an operator probably after
0:14:10a trial
0:14:11or requesting the user to
0:14:15the request
0:14:17but the ones one other branch to this
0:14:20dialogue system
0:14:22which ones when we encounter multiple destinations
0:14:27multiple destination i will explain the night the next slide
0:14:33this was evaluated
0:14:35what
0:14:37bank
0:14:38an insurance company
0:14:41what forty
0:14:43routing destinations
0:14:45and at that time
0:14:49despite the fact that the asr was not
0:14:51at the same level but it is very
0:14:53wait ninety six percent routing accuracy
0:14:57which is that the
0:15:01there
0:15:02false alarm rate was only about four percent
0:15:06eight percent of those calls we're up to operator but we did not keep statistics
0:15:11on how many nodes
0:15:14where legitimate routes because they request was
0:15:17totally out of domain and how many were actual classes
0:15:22so the disambiguation die well
0:15:26serves two purposes one
0:15:28the customer may not know
0:15:30the exact structure of
0:15:32probably
0:15:34three
0:15:36and second would be to combine certain classes so that we have better separation more
0:15:42success routing
0:15:44so if the user should i'm looking for a used car alone
0:15:48there will be only one branch that would a satisfying criterion
0:15:53but the user may say either alone or track one track
0:15:58not one of their
0:16:00words in the vocabulary
0:16:02then the machine would get them into a long
0:16:05and start a dialogue
0:16:07and what S
0:16:09this is
0:16:10an existing
0:16:12i'm sorry one task
0:16:14but you of the user option is the so called home or personal
0:16:20once the use of santa carla going to that range
0:16:24but not because there are only two options
0:16:27a system would ask is this an existing long and the user signal it's and
0:16:32one and L
0:16:34the call one euro successfully
0:16:40the underlying technology for this
0:16:42was
0:16:44want
0:16:45or train spotting
0:16:46which was easy to configure
0:16:49it did require language check expertise
0:16:53and a
0:16:55what is extremely accurate especially when routing destinations for mine
0:17:01and it was easy to a
0:17:04adopts a new
0:17:05right
0:17:07the second alternative for this one again to train from parallel corpora
0:17:13and
0:17:16in my opinion it's
0:17:18a slight overkill
0:17:20although analysis or the data
0:17:23would provide the lexicon which could be used for that you more or three spotting
0:17:34a during the commanding it is up
0:17:38often
0:17:39there is often the need
0:17:41for
0:17:43verification
0:17:44or indication of the user
0:17:47this is sort of an si but i wanted to show you
0:17:51i
0:17:53really easy
0:17:55to enrol system for syndication
0:17:58because customarily would have their customer quality in times so you can get their voice
0:18:05so we start with a colour calling for an icon number login or whatever
0:18:11and it's difficult account does not exist would go to an agent
0:18:15but if the account does exist
0:18:17then
0:18:18we look at the user models and if it's a indicated
0:18:22then we she can choose not necessarily but it may choose to add
0:18:28that information
0:18:30to the customer data for adaptation
0:18:33however the authentication failed we going to form authentication which would be soaring to the
0:18:42customer challenge
0:18:44a questions and they don't wear answered correctly that
0:18:48user would again we also indicated and their speech would be sent
0:18:53to the data for training so that the next time they would be automatically verified
0:18:59it failed again we go to a human operator
0:19:04so this
0:19:06is an extremely easy to implement a use paradigm for percent age
0:19:15next
0:19:17application
0:19:19i called form filling application
0:19:22and it involves many
0:19:25type of an application such as travel
0:19:28a reservation
0:19:30appointment
0:19:32many simple transactions
0:19:34and
0:19:35which could be back to in section are still store transaction
0:19:40and these type of our application
0:19:43a there are many fields to be filled
0:19:46in order to be able to execute
0:19:49they request
0:19:52i have taken the liberty of
0:19:54writing out the script
0:19:56of what
0:19:57i generally use one i want to find out that might trained is running on
0:20:01time
0:20:03and
0:20:04this is a less the state-of-the-art in
0:20:08for form filling up with patients today
0:20:14as you can see
0:20:16it's a
0:20:17very strenuous process
0:20:21so present
0:20:23they technology as the one where you computer initiated dialogue
0:20:28it is well designed for confirmation and does a fairly good job of error detection
0:20:34but it's not really an example of asr you
0:20:38and not really the state-of-the-art in the technology
0:20:41it's just what is available out there today
0:20:47by contrast
0:20:50this has nothing to do with me although it is darpa
0:20:53darpa did run the program whole at this
0:20:57many years ago
0:20:59and there's was really a state-of-the-art program
0:21:02using mixed initiative dialogue
0:21:06being able to fill many of the entries in the form
0:21:10with a single utterance with good error detection
0:21:14and clarification dialogue
0:21:17and
0:21:18now application that i showed before would be much better it should look like this
0:21:23where you can say something like what times the train from new york right in
0:21:28front of well one
0:21:31and since you didn't say which what the data was machine just simply know that
0:21:36was missing in the form
0:21:38and ask you for that for each
0:21:43again we look and into the other line technology
0:21:48my opinion is that this is best served what the
0:21:53syntactic analysis shallow semantics
0:21:56is a possibility but
0:21:59not necessary for most of these applications
0:22:03so it would be easy to implement
0:22:05as long as you have
0:22:07are fairly robust
0:22:09analysis of the syntax
0:22:12and
0:22:14it also may help
0:22:21that paradigm for machine learning
0:22:25would be difficult to generalise to other applications but could usable enough training
0:22:32however more
0:22:34or phrase spotting would not think of set of structuring solution because
0:22:38you'd have too many keywords in each phrase uttered in the
0:22:49okay
0:22:51i have the signal
0:22:57i'm going to a
0:22:59change based now in going to
0:23:03speech translation application
0:23:06before continuing like to play a very short segment of videotape
0:23:12and
0:23:13i know that your recognizer at least one culprit the video and many of you
0:23:18will probably recognized extracting
0:23:29how i'd like to buy pesetas
0:23:34but i
0:23:37note this adorable formally and if you kevin
0:23:41i mean my
0:23:44here's my passport
0:23:50what is the exchange rate between us dollars and pesetas
0:24:02okay so
0:24:05this finding out that is
0:24:07the first
0:24:10bilingual
0:24:11dialogue or speech-to-speech translation paradigm
0:24:16not reliable and i'm not sure whether is here today
0:24:20as disputed
0:24:23and the parents that cmu
0:24:25you know balance we first do this
0:24:30i'm not sure whether he's right on that because
0:24:34when this was implemented a
0:24:37there was no asr system the trend in real time computer
0:24:41and this the of course balanced it can start
0:24:45special hardware consisting of twelve
0:24:49the S P modules running parallel seems to be able to the asr in more
0:24:55or less real time there is slightly later
0:24:58but
0:25:00was an accomplishment in that sense
0:25:05the system
0:25:07consisted of a speech recognizer
0:25:12with a
0:25:15specific grammar for the application
0:25:19of a lingual parser
0:25:21we only bilingual translator
0:25:24not really a translator but it was bilingual translator
0:25:29to text-to-speech modules
0:25:31which
0:25:33a speech out but
0:25:36it was probably better to describe system and i can see what's actively involved but
0:25:42i think it was that
0:25:44around four hundred words
0:25:47keyboards in each of the system of course the translation what's
0:25:51quite straightforward since
0:25:53you know what boards were
0:26:01a two days a bilingual
0:26:04you meant dialogue
0:26:06is quite different
0:26:08the underlying technology has been replaced by a generalized
0:26:13so today statistical machine translation
0:26:18okay
0:26:19present applications
0:26:21are quite good
0:26:23forcing single parent restricted domain applications
0:26:28they're not as robust but still extremely good for under strict dial
0:26:35but the single turn
0:26:37is not accurate enough for multi turned dialogues i think we're all familiar or maybe
0:26:43not what the all
0:26:46what the or
0:26:48a telephone game where you say something to your neighbour and the continues along time
0:26:53within
0:26:55it has no resemblance to what the message was originally
0:27:00and of course this is what will happen
0:27:04since the to convergence
0:27:06do not understand each other language
0:27:09so there's address the can hear need for clarification in this disambiguation
0:27:15which would result in human-machine dialogue
0:27:19for the translation happens
0:27:22and there's also need to understand the context
0:27:26core friends and so on endeavoured to be able to succeed with a multi turn
0:27:33freeform conversation
0:27:39well known to come and the control
0:27:43i will describe
0:27:44three applications
0:27:47a
0:27:48personally agents
0:27:50computer user interface by voice and robot control
0:27:59this is another
0:28:02project
0:28:04the last project that we did before
0:28:07we some closed it's doors on bell labs
0:28:10which was a personal agent
0:28:13in those days and then it was quite different and it's to the egg rolls
0:28:20in two thousand and one i don't think
0:28:23we force all there
0:28:25prevalence of
0:28:27smart
0:28:29what we don't colour phone
0:28:31in those days mobile phones
0:28:34strictly were used for voice
0:28:36so this type of replication was extremely necessary
0:28:41so it consisted of a variety of branches we did not get to do too
0:28:45many of them
0:28:47but we did manage to
0:28:50a
0:28:52do
0:28:53function for
0:28:54remote reading and writing of email services
0:29:00so it was partially implemented at bell labs and two thousand and one
0:29:05was it will dialogue capabilities
0:29:11the advantage for this system was that it could
0:29:14quality and
0:29:16a lexicon depending on the task
0:29:20so for example if you're given a day
0:29:23that you're interested in an email you could collect all the nine
0:29:27and subjects for that they so the one who pro
0:29:31so that they to see
0:29:34and have an email remotely right to down
0:29:38there was a error detection
0:29:40and clarification dialogue
0:29:43but in addition
0:29:44there was a test task dependent
0:29:48what men
0:29:49so this system did not need any startup training
0:29:54there were quite a few other systems of this nature at that time and they
0:30:00also for the mice because they required by to have our training
0:30:05and very few customers for willing to spend time
0:30:09this is not important
0:30:10less than that i will touch and lighter in my conclusion
0:30:18we talk about
0:30:20computer voice interface
0:30:25it was originally conceived as that a lengthening interface
0:30:30because
0:30:31if you wanted to probe your
0:30:33computer remotely a
0:30:36there was another way to do it in as i said that has disappeared to
0:30:40the
0:30:43a margin so smart phones
0:30:47but it does contribute to ease of use
0:30:51and especially states to handicap
0:30:56the mouse in this case is and headed
0:30:59the mentioned for
0:31:01multimodal use
0:31:04but of course one could
0:31:06also use a gestures
0:31:09and i tracking care your computer is equipped to do that
0:31:15it does enhance the interactions
0:31:18so for example if you're word and excel sheet
0:31:21a
0:31:22so out of having to write the formulas you could simply
0:31:26a
0:31:26verbalise
0:31:28without the model by saying average on three or
0:31:32with a mouse simply point
0:31:34to the column or with your finger
0:31:37and say average this call
0:31:42and finally
0:31:44robotic command and control
0:31:48okay a
0:31:50nelson showed us a at all
0:31:53but time
0:31:55few weeks ago hours the visiting my granddaughter and she actually has the story and
0:32:01this is not
0:32:02a
0:32:03i think voice response story it's actually
0:32:07training by the child and does all sorts of things like set and calm and
0:32:12you can see
0:32:14my resilience like no
0:32:22i'm sure many of your seen in the robotic
0:32:25would be wildly
0:32:27which was a garbage one thing
0:32:30the vice robotic device not voice control
0:32:36this is a device a used by the military to
0:32:41explore spaces
0:32:43and that's use bonds
0:32:46generally it's not used what's voice control but
0:32:50activated by joystick
0:32:53what if
0:32:54the soldiers not have time to wait for it to explore the space before they
0:32:59would enter
0:33:00first control would certainly help
0:33:03and finally
0:33:05this is a program run the vault that are
0:33:09what the strange name a big door
0:33:12i don't know why it's called big door
0:33:15you all would probably better because it's meant to carry a
0:33:23a lot of provision so that the soldiers not too late with a
0:33:29the weight
0:33:31and a this particular device can certainly use voice control because it is accompanying this
0:33:37altar and soldier
0:33:39needs to remain hands-free and i three to be able to operate
0:33:45so one
0:33:48it is found in torrance
0:33:50and extremely useful for both commercial
0:33:53and military purposes
0:33:57big tall as they showed before is a companion to a soldier
0:34:02and it's the perfect setup
0:34:04for multi modal communication
0:34:07because when you have your tonight three it certainly is one more natural to select
0:34:12a big door
0:34:14go there and point to it
0:34:16or
0:34:17have it fall or your gaze
0:34:20and
0:34:24on the other thing that i and added here is
0:34:29the reporters of multimodal communication
0:34:32could be found in yours
0:34:34where
0:34:35there were about itself
0:34:37wouldn't use gesture
0:34:38is direction finder
0:34:44so
0:34:46i would like not to address
0:34:50what i think
0:34:51is necessary for the future
0:34:55and
0:34:57obviously for asr
0:35:00we still have
0:35:02a problem
0:35:04where its robustness to noise
0:35:07channel conditions
0:35:10i believe that is being worked on
0:35:16but there is
0:35:18and word making problem
0:35:21a language modeling which prevents the technology from being robust
0:35:26for topic in general
0:35:29very often
0:35:31we train
0:35:33well lots of data for a specific on the right switch to a different genre
0:35:39a
0:35:40the accuracy falls very drastically
0:35:45so i don't believe that we need
0:35:48spend a lot of effort
0:35:50researching language models
0:35:54and i had the luxury few years ago to
0:35:59have an experiment done
0:36:02because i was curious as to
0:36:05how does computer phone like a phonetic transcription relates to implement phonetic transcription
0:36:13a most people believe that humans are extremely adapted phonetic transcriptions
0:36:19and i believe that is because many of the experiments that have been done
0:36:25in transcribing
0:36:27phonetic so done in artificial settings and results are much higher
0:36:32then
0:36:34should be
0:36:36so
0:36:37we ran an experiment where we ask human trends to transcribe speech naturally
0:36:43except that they have no
0:36:46lexical semantic or you even phonotactic information
0:36:52to do that
0:36:53shows two languages with an extremely similar phoneme set
0:36:58have one set the native speakers speak one language and have another set of native
0:37:04speakers
0:37:06transcribe that in their own language
0:37:08as best they could
0:37:11experiment was actually cherry with
0:37:14and additional language i will surely the results for the first two languages which were
0:37:19japanese
0:37:20and italian
0:37:23which have a tremendous overlap phonemes and as you can see here
0:37:27i guess are had a
0:37:31thirty four point nine phone error rate
0:37:34the average human head twenty nine point nine
0:37:38the best thing when had seventeen point two but the words
0:37:43much exceeded the machine
0:37:46humans have no trouble understanding even thirty seven point five percent
0:37:52phone error rate
0:37:55experiment was also done by using
0:37:58spanish and italian
0:38:00and of course
0:38:02there is
0:38:04quite a bit of phonotactic over a wide and some lexical overlap and the results
0:38:10for
0:38:11spanish-italian much higher
0:38:13but when you're bored of any kind of language models and phonotactic models
0:38:19obviously
0:38:21the machines are doing almost as well there is really here for about
0:38:26fifty percent relative improvement
0:38:30i might add that the recognizer use the here was not that the neural net
0:38:34recognizer and we're beginning to see that fifteen percent relative improvement
0:38:41so
0:38:42maybe some
0:38:43the machines well matched the human ability to transcribe
0:38:51going on
0:38:53people always talk about
0:38:55prosodic analysis in asr
0:38:58but data
0:38:59so far there has been very little research
0:39:03it's not important
0:39:05a for transcription
0:39:07or one way translation
0:39:10but it's extremely important for dialogue goes
0:39:14intent
0:39:15does drive to dial
0:39:21those of you who've known me in the past will probably wondering why i didn't
0:39:25say much about text-to-speech so far
0:39:28but to
0:39:31that technology has a really taking a turn
0:39:36in some respects for the better but in many respects for the words
0:39:40a
0:39:41it sounds a lot more natural than it did
0:39:45in the nineties
0:39:47because of the
0:39:50all right hmm models and other large vocabulary large data
0:39:55synthesis
0:39:57but presently has
0:39:59fairly much disappeared from text to speech
0:40:04again it may not be important
0:40:07if you're expecting a once and actually spawn
0:40:11but
0:40:13if you're trying to listen profile paragraph i guarantee that you will not have much
0:40:18comprehension
0:40:22the present to me that he of text-to-speech
0:40:26still does quality evaluations but as part i know
0:40:30a they don't too much comprehension evaluation of my cat cup with the community so
0:40:35i'm not sure but i think it would be who
0:40:38to do
0:40:40an experiment which we actually did years ago which it's present a very large complex
0:40:45paragraph
0:40:46we attacks the speech
0:40:48and then do college or like
0:40:51multi
0:40:53choice questions and see how much is reading
0:41:01for these applications
0:41:03error detection and what localisation is extremely important
0:41:09i make it
0:41:16and
0:41:21my computer had problems here
0:41:23and we need the dialogue for error recovery
0:41:28also dialogue for help menu is extremely important to facilitate a
0:41:35applications
0:41:37and finally
0:41:39joint optimization between the asr and their application
0:41:43a quite often
0:41:45reduces the error for the application
0:41:48even if it may increase the word error rate for the asr
0:41:53and we have seen that
0:41:55repeatedly and
0:41:56various programs where we're at the either
0:42:00transcriptions from speech are transcription from and writing
0:42:05going to speech translation or joint optimization actually all
0:42:19we cannot do a
0:42:21for this community for many of the problems that are preventing
0:42:28certain applications
0:42:30to become deplorable
0:42:32there has to be a lot more work in Q and they and the information
0:42:36retrieval
0:42:39there has been working on that but i don't believe that the accuracy is such
0:42:44that would satisfy
0:42:48kind of customers that
0:42:51what call into it
0:42:53may have it does have a lot of value in
0:42:58more
0:43:00type of analysis work
0:43:02but
0:43:05we have to have
0:43:07very blessed false alarm and
0:43:10a lot more
0:43:14detection
0:43:15before we can actually do qualities
0:43:18and i know that
0:43:21it's my turned back and we well
0:43:25we will is the giant and information retrieval
0:43:28and it does have hundred percent recall
0:43:32but it also had zero percent precision
0:43:38and one
0:43:39should not expect
0:43:41to get
0:43:42responses
0:43:45with zero percent precision can we actually for
0:43:50doing we had
0:43:54one aspect of gale walls
0:43:56a
0:43:57what we call this relation which was a very different responses
0:44:03targeted
0:44:05and
0:44:06when danced
0:44:07applying where it was important who they want to one
0:44:13and we had one such example
0:44:16or was one more prevalent
0:44:18for those who
0:44:19to go down the wound up there who
0:44:23a the first fifty responses by google were all the reverse
0:44:29well
0:44:30the gale distillation was actually able to pick
0:44:33but
0:44:34still think that there's a lot more work
0:44:39again there should be a lot more work done in unrestricted bilingual dialogue
0:44:53what don
0:44:56one of the things that
0:44:59prevent this
0:45:00technology from going
0:45:02for
0:45:03is that there is a need for platforms that one
0:45:08a lot of the platforms
0:45:10i haven't done as an experiment
0:45:14so for example
0:45:17if you have
0:45:19hey
0:45:23dialogue system whatsoever about or with your desktop
0:45:28whenever it encounters an oov if you can explain that word and habit reading that
0:45:34or whenever it encounters a construction
0:45:37that it does not understand
0:45:39comes back with their clarification dialogue and you can explain it
0:45:43eventually subsystems would become smarter
0:45:48and better
0:45:51systems
0:45:52should be eventually configured to be able to do planning and inference
0:45:57and finally
0:46:00although
0:46:01just before i probably not well i started the program
0:46:05and grounded language acquisition for the full a i semantics
0:46:10that did not go very far but i do believe that there's room
0:46:14to do a lot of research in this area
0:46:20with my final slide on trips like for a little about the choice of applications
0:46:27so when designing an application they should be
0:46:31customers
0:46:32a standard applications
0:46:36applications with too many false alarms
0:46:39that is
0:46:40a router with too many
0:46:42but routes
0:46:44a
0:46:45engenders lack of trust by the customer
0:46:49the number of misses is not that's crucial but it is application dependent
0:46:54because you can always have sort fail for
0:46:58when you miss
0:46:59an action
0:47:02but it is also important
0:47:04to reduce the cost of enrolment and the cost of learning i specific application
0:47:12which is usually done by
0:47:15a machine itself detecting errors and correct them
0:47:21it is
0:47:23is important to design compelling applications some applications maybe
0:47:29easy to implement
0:47:31but unless they have an urgent
0:47:33need
0:47:34the most likely will fail
0:47:39it's also always wise to ensure that your application
0:47:44it is compelling selected as an alternative
0:47:49way to accomplish the task
0:47:54five there is alternative again the application
0:47:58well disappear
0:48:00and
0:48:01finally i'd like to and this on a real positive know which is good news
0:48:09and you're all for a
0:48:12bill gates this and that speech is the most natural form of communication
0:48:19and
0:48:20where actually saying at
0:48:22speech and multi modality despite
0:48:26their prevalence of smart phones it's not disappear
0:48:30and
0:48:32many of their
0:48:34internet giants are
0:48:37investing
0:48:38heavily
0:48:40speech technology
0:48:53any questions
0:49:02some
0:49:09someone might mistakenly get the impression from the part where you quoted the hot comparatively
0:49:16low error rates for the machine and on phones
0:49:20that there's nothing to be done an acoustic part i don't think you think that
0:49:24is your the other bullet about noise and reverberation where i think probably to machines
0:49:28and fail much faster than deep
0:49:33well i
0:49:34as i said there's still that fifteen percent
0:49:38and
0:49:40no more than that i mean because i don't define it should experiment with the
0:49:44noise level finish my higher
0:49:49which buttons so there is still that fifteen percent
0:49:52but also if you know this one of the humans actually did twice as well
0:49:58as the machine and there's no reason to assume that the machines can do that
0:50:05well either so yes there is plenty of room for improvement
0:50:09and this amount fact there's no reason to assume that the machines can do better
0:50:13than human
0:50:15there are many tasks
0:50:17a specifically a speaker verification where machines are more capable than humans to do it
0:50:25so i'm not saying that a
0:50:28as far as noise is concerned i would love to learn the same experiment in
0:50:34or is that was run for clean speech because i think human phonetic recognition is
0:50:40in noise will drop weight down
0:50:43just like the machine a
0:50:45used to use alternative strategies to be able to transcribe speech they don't just use
0:50:52the phone set
0:50:54they have a lot more knowledge which is in the language model of the syntax
0:51:00semantics
0:51:01and
0:51:03yes so there is plenty of room to do research and acoustics
0:51:08but other parts are really lagging we have been start with n-gram models
0:51:14okay so we will have done as for translation i don't know about for transcription
0:51:21seven n-gram models space becomes extremely flat and i have always to use the same
0:51:29example
0:51:30if i have a bunch of words followed by the that followed by a lot
0:51:35of words followed by toward shoe followed by a lot of words provided
0:51:40word then
0:51:41they're chew bone are much more compelling that there were really hairy black
0:51:49while sitting you know outside my challenge
0:51:53so
0:51:55yes i think that although
0:51:58many of my colleagues have assured me that it's been trying to i think it
0:52:01should be tried again
0:52:03try to find
0:52:05that are rolling then what we are using
0:52:11more
0:52:14so well most of us here with nist maybe one are in the cycle or
0:52:20out of their of an R N B psycho in speech technologies
0:52:24or you probably witness like whole bunch of this cycle so is there something that
0:52:28surprised you
0:52:30in the last time something that you basically were not expecting and
0:52:34okay
0:52:42i would say that to
0:52:45in this sense nothing surprise me but
0:52:51i think the technology is continuing on an upward trend in
0:52:58all aspects of the technology the language as well as the transcription
0:53:03the cycles are very long and are
0:53:07we want to wanna get a break through the use you
0:53:11that points function and the rest of the time they are incremental
0:53:18i don't know whenever i discussed this nobody seems to recall it but a full
0:53:24we gave a
0:53:26and by that for you there i can support interspeech i don't remember which one
0:53:31it was but it was in hawaii
0:53:34where he was
0:53:36the money in the fact that speech recognition improvements and the nineteen eighty five instances
0:53:42then all the effort has been in application
0:53:46i don't really what that observation
0:53:50but progress is very slow down
0:53:55where lower
0:53:56near
0:53:57the ability to
0:53:59transcribe and restricted word well all genre in all
0:54:05or be able to understand
0:54:08and you
0:54:10so
0:54:14might don't really basically consisted all
0:54:17doable application