0:00:16 | okay a first i wanted to thank the committee for and very here |
---|
0:00:22 | i data |
---|
0:00:24 | when i was reading the speech i had a lot of fun because it can |
---|
0:00:27 | be back to the old days and |
---|
0:00:30 | i'll cover a lot of history and i have precedent from now but i think |
---|
0:00:36 | we can learn a lot from |
---|
0:00:38 | a |
---|
0:00:40 | hence ventures steaks and so on |
---|
0:00:45 | a |
---|
0:00:47 | there's also |
---|
0:00:49 | very short cycle and research some more weeping to eighteen to twenty years everybody for |
---|
0:00:55 | us what was done before so it's always |
---|
0:00:58 | nice to review |
---|
0:01:03 | like to |
---|
0:01:05 | you have someone a tremendous before i start and many people were involved in |
---|
0:01:11 | well the brunt instead i will describe |
---|
0:01:14 | but i one |
---|
0:01:16 | special |
---|
0:01:17 | acknowledgment for my cold calling generally |
---|
0:01:21 | whose expertise |
---|
0:01:23 | knowledge and imagination lead to a lot about this crime |
---|
0:01:30 | so with all that i will proceed might walk |
---|
0:01:34 | and dropped is asr you and where now in the third day of |
---|
0:01:41 | asr you and so what is for two days we have lots of talk about |
---|
0:01:46 | star |
---|
0:01:48 | but the you has been the thing |
---|
0:01:51 | so are you trying to somehow feel that |
---|
0:01:56 | and down |
---|
0:01:57 | and so on yahoo he's any branch |
---|
0:02:00 | a larger family of |
---|
0:02:03 | applications which usually is referred to as natural language processing |
---|
0:02:10 | in the |
---|
0:02:11 | natural language processing |
---|
0:02:14 | usually consists of variety of inputs |
---|
0:02:19 | to most people unicode or typed input |
---|
0:02:24 | would seem to be the simplest |
---|
0:02:27 | does not require transcription and four |
---|
0:02:31 | most languages you have things like word boundaries and punctuation although when you're typing you |
---|
0:02:37 | may not i punctuations but |
---|
0:02:39 | when you do returned you or something that that's the end of your request |
---|
0:02:44 | but it has certain problems like home of graphs |
---|
0:02:50 | probably some most words in however problems that occur |
---|
0:02:54 | when you're trying to get any representation from |
---|
0:02:58 | the input |
---|
0:03:01 | i wrote here hardcopy but alignment was hand handwritten input |
---|
0:03:07 | and a |
---|
0:03:09 | it shows a lot about the difficulties when typed input |
---|
0:03:14 | but it has the and the difficulty that |
---|
0:03:16 | we require transcription |
---|
0:03:19 | it's not as bad as when you're dealing with hardcopy because it online and you |
---|
0:03:23 | have a contract |
---|
0:03:26 | the stroke and consequently |
---|
0:03:29 | probably get a lot less errors but it still challenging |
---|
0:03:33 | i speech |
---|
0:03:35 | in the sense shares the same different properties that |
---|
0:03:38 | handwriting input chair |
---|
0:03:41 | but it has done and also of course and speech shot we do have the |
---|
0:03:45 | problem of deciding what |
---|
0:03:47 | things like for one |
---|
0:03:50 | but speech does have a single feature that is not common to there |
---|
0:03:54 | first two |
---|
0:03:55 | which is presently |
---|
0:03:57 | which in these particular system in my opinion is extremely important |
---|
0:04:05 | one you trying to transcribe |
---|
0:04:07 | speech just for transcription say |
---|
0:04:11 | it really doesn't matter |
---|
0:04:13 | but when you're trying to and for the intended needing |
---|
0:04:17 | presently may or may not play the role |
---|
0:04:20 | and by whatever example |
---|
0:04:23 | a |
---|
0:04:25 | just a simple |
---|
0:04:27 | question like |
---|
0:04:28 | is there is a well |
---|
0:04:30 | depending on |
---|
0:04:32 | whether you start toward this or one |
---|
0:04:35 | there is still remains ambiguous somewhat but you're something |
---|
0:04:39 | that the response |
---|
0:04:40 | well be noted that is a book |
---|
0:04:43 | but when you since the word but |
---|
0:04:45 | the response time would be you know that is a magazine or no or whatever |
---|
0:04:50 | so that |
---|
0:04:52 | that ambiguity is not the resulting recyclable from text alone especially in this situation in |
---|
0:04:59 | a dialogue situation |
---|
0:05:02 | also going on |
---|
0:05:06 | i replaced |
---|
0:05:07 | the meaning representation with applications |
---|
0:05:11 | which will result in there in actions or probable probably responses |
---|
0:05:17 | and |
---|
0:05:19 | taking actually there |
---|
0:05:21 | i've taken the liberty of making |
---|
0:05:24 | three separate |
---|
0:05:27 | application classes |
---|
0:05:29 | a these are for my convenience for the car they're by no means |
---|
0:05:34 | meant to be |
---|
0:05:35 | a rule or and there is going to be some overlap between |
---|
0:05:41 | some of these applications |
---|
0:05:43 | but i will discuss |
---|
0:05:45 | a these different applications it's like to like talk |
---|
0:05:50 | but a bit what we can see |
---|
0:05:54 | and from now on i will take what asr you which means that are going |
---|
0:05:58 | to have speech and what |
---|
0:06:02 | and data and you please applications will have to rely on the dialogue system |
---|
0:06:08 | so in the next slide i will have a chart for example for style of |
---|
0:06:14 | system |
---|
0:06:15 | since i can explain so on |
---|
0:06:18 | so basically |
---|
0:06:20 | since i used to work for the telephone company the input this telephone but it |
---|
0:06:25 | could be |
---|
0:06:26 | just simply a microphone in but |
---|
0:06:29 | and |
---|
0:06:31 | next stage is it transcription task |
---|
0:06:34 | what we |
---|
0:06:36 | text and customarily that would be |
---|
0:06:40 | large vocabulary continuous speech recognizer used |
---|
0:06:44 | in the next stage we're going to try to extract meaning |
---|
0:06:49 | and a meaning maybe application driven or maybe totally unrestricted |
---|
0:06:55 | the second one is not within days not you because it requires pool semantic interpretation |
---|
0:07:04 | but we'll talk a lot about the application environment |
---|
0:07:08 | semantic rules |
---|
0:07:10 | and when i say rules here i did not mean manually constructed rules necessarily |
---|
0:07:17 | finally we get to the dialogue manager that has to make a decision |
---|
0:07:22 | in there is a response or word error is really detected |
---|
0:07:28 | a quite well we should |
---|
0:07:29 | back to the user |
---|
0:07:33 | and it's an action is necessary then in section will be invoked |
---|
0:07:39 | and down |
---|
0:07:40 | i'd like to spend a couple of minutes something about that |
---|
0:07:46 | language analyser portion of this |
---|
0:07:49 | and |
---|
0:07:51 | again i will have a few suggestions for this but by no means these should |
---|
0:07:55 | be thought of as |
---|
0:07:57 | all encompassing |
---|
0:08:03 | so |
---|
0:08:04 | the simplest method is to use keyword or |
---|
0:08:09 | free spotting |
---|
0:08:10 | this immature technology which is very robust to asr |
---|
0:08:15 | or is it is manually configured |
---|
0:08:19 | but it is easy to change an application by simply adding |
---|
0:08:24 | a |
---|
0:08:25 | content to it |
---|
0:08:27 | it does require an expert to design |
---|
0:08:31 | next is what most people referred to as statistical methods |
---|
0:08:36 | i don't like that because |
---|
0:08:39 | statistical methods also referred to other aspects |
---|
0:08:43 | like parsing |
---|
0:08:45 | so i used a concept of machine learning from parallel corpora |
---|
0:08:50 | and here you have speech on one side and a result actions on the other |
---|
0:08:55 | side you can map them |
---|
0:08:58 | fairly much the way speech translation system |
---|
0:09:02 | this is of course a is fully automatic |
---|
0:09:06 | but you do need to obtain data |
---|
0:09:09 | in many applications that they that would be very easy to acquire |
---|
0:09:14 | huh |
---|
0:09:15 | the main drawback is that if you want to change or add something to your |
---|
0:09:20 | application |
---|
0:09:21 | need to do |
---|
0:09:23 | additional training |
---|
0:09:28 | syntactic analysis |
---|
0:09:30 | would be very good for some applications |
---|
0:09:34 | it is not as robust as some of the other technologies |
---|
0:09:39 | but its application |
---|
0:09:42 | can be trained with the specific genre or topic |
---|
0:09:47 | then there |
---|
0:09:49 | analysis can become very robust |
---|
0:09:54 | and again it is |
---|
0:09:59 | quite |
---|
0:09:59 | easy to |
---|
0:10:02 | change or extend applications |
---|
0:10:05 | and it is also helpful in conjunction with asr |
---|
0:10:09 | for our detection and localisation |
---|
0:10:14 | you have some and text |
---|
0:10:15 | just contributes additional information |
---|
0:10:18 | when necessary for |
---|
0:10:20 | a |
---|
0:10:21 | the arguments themselves and |
---|
0:10:24 | that's predicate argument analysis |
---|
0:10:27 | very important for |
---|
0:10:30 | queries which i'll discuss later in my talk |
---|
0:10:33 | and finally |
---|
0:10:35 | a deep semantics which are not discuss because it's really is not ready for prime |
---|
0:10:40 | time |
---|
0:10:46 | so i will start by discussing call center applications |
---|
0:10:49 | and this is something |
---|
0:10:52 | that |
---|
0:10:54 | we did work on into like nineties |
---|
0:10:56 | a one lucent was very involved in |
---|
0:11:01 | small business switching units |
---|
0:11:06 | the business is huge so it's commercially extremely while |
---|
0:11:12 | of course it is much larger they and then estimated eighty billion dollars a lot |
---|
0:11:16 | was quoted |
---|
0:11:19 | nineties |
---|
0:11:21 | it could that does not have to replace a human operator just cutting a human |
---|
0:11:25 | operators time |
---|
0:11:27 | a could result in tremendous savings |
---|
0:11:32 | and down |
---|
0:11:34 | now turn |
---|
0:11:36 | to probably the first successful be deployed asr you application |
---|
0:11:42 | which was the at and you operator systems |
---|
0:11:46 | for this simple application |
---|
0:11:49 | with natural language and but of course |
---|
0:11:53 | what is not natural language analysis it was you |
---|
0:11:57 | word or phrase spotting for only five words |
---|
0:12:02 | and |
---|
0:12:03 | right behind press to remember all five words |
---|
0:12:07 | but i was employed by eighteen P |
---|
0:12:10 | which at that time was |
---|
0:12:12 | largest corporation in the world what six hundred |
---|
0:12:17 | that's operators |
---|
0:12:19 | so just cutting a few seconds of each |
---|
0:12:23 | operator can query say accompanied approximately three hundred million dollars here |
---|
0:12:31 | going back |
---|
0:12:33 | i have a |
---|
0:12:35 | list and here for applications for |
---|
0:12:39 | a call center |
---|
0:12:43 | call routing and form filling i will discuss in |
---|
0:12:47 | great detail |
---|
0:12:50 | unrestricted interactions which would be something like actually |
---|
0:12:54 | probably my voice |
---|
0:12:55 | a complete website of well store or business |
---|
0:13:01 | is something that |
---|
0:13:02 | will come up in later discussion |
---|
0:13:06 | she when a an effect you |
---|
0:13:09 | are not limited by the asr capabilities but by that an L P K bodies |
---|
0:13:16 | so i will not discussed in much except in my conclusion |
---|
0:13:21 | so if we start with a |
---|
0:13:25 | quite a colour for call centre |
---|
0:13:28 | we actually implement the one below items |
---|
0:13:31 | ground turn the century |
---|
0:13:34 | a data |
---|
0:13:36 | the whiteboard was very question in |
---|
0:13:40 | and there was a |
---|
0:13:42 | matrix routing and the confidence scoring |
---|
0:13:45 | and as well as some destination threshold |
---|
0:13:50 | if everything was met |
---|
0:13:52 | the car was routed |
---|
0:13:56 | if either one of those tail |
---|
0:13:59 | the system had an option of |
---|
0:14:02 | we question |
---|
0:14:05 | a standing |
---|
0:14:07 | to an operator probably after |
---|
0:14:10 | a trial |
---|
0:14:11 | or requesting the user to |
---|
0:14:15 | the request |
---|
0:14:17 | but the ones one other branch to this |
---|
0:14:20 | dialogue system |
---|
0:14:22 | which ones when we encounter multiple destinations |
---|
0:14:27 | multiple destination i will explain the night the next slide |
---|
0:14:33 | this was evaluated |
---|
0:14:35 | what |
---|
0:14:37 | bank |
---|
0:14:38 | an insurance company |
---|
0:14:41 | what forty |
---|
0:14:43 | routing destinations |
---|
0:14:45 | and at that time |
---|
0:14:49 | despite the fact that the asr was not |
---|
0:14:51 | at the same level but it is very |
---|
0:14:53 | wait ninety six percent routing accuracy |
---|
0:14:57 | which is that the |
---|
0:15:01 | there |
---|
0:15:02 | false alarm rate was only about four percent |
---|
0:15:06 | eight percent of those calls we're up to operator but we did not keep statistics |
---|
0:15:11 | on how many nodes |
---|
0:15:14 | where legitimate routes because they request was |
---|
0:15:17 | totally out of domain and how many were actual classes |
---|
0:15:22 | so the disambiguation die well |
---|
0:15:26 | serves two purposes one |
---|
0:15:28 | the customer may not know |
---|
0:15:30 | the exact structure of |
---|
0:15:32 | probably |
---|
0:15:34 | three |
---|
0:15:36 | and second would be to combine certain classes so that we have better separation more |
---|
0:15:42 | success routing |
---|
0:15:44 | so if the user should i'm looking for a used car alone |
---|
0:15:48 | there will be only one branch that would a satisfying criterion |
---|
0:15:53 | but the user may say either alone or track one track |
---|
0:15:58 | not one of their |
---|
0:16:00 | words in the vocabulary |
---|
0:16:02 | then the machine would get them into a long |
---|
0:16:05 | and start a dialogue |
---|
0:16:07 | and what S |
---|
0:16:09 | this is |
---|
0:16:10 | an existing |
---|
0:16:12 | i'm sorry one task |
---|
0:16:14 | but you of the user option is the so called home or personal |
---|
0:16:20 | once the use of santa carla going to that range |
---|
0:16:24 | but not because there are only two options |
---|
0:16:27 | a system would ask is this an existing long and the user signal it's and |
---|
0:16:32 | one and L |
---|
0:16:34 | the call one euro successfully |
---|
0:16:40 | the underlying technology for this |
---|
0:16:42 | was |
---|
0:16:44 | want |
---|
0:16:45 | or train spotting |
---|
0:16:46 | which was easy to configure |
---|
0:16:49 | it did require language check expertise |
---|
0:16:53 | and a |
---|
0:16:55 | what is extremely accurate especially when routing destinations for mine |
---|
0:17:01 | and it was easy to a |
---|
0:17:04 | adopts a new |
---|
0:17:05 | right |
---|
0:17:07 | the second alternative for this one again to train from parallel corpora |
---|
0:17:13 | and |
---|
0:17:16 | in my opinion it's |
---|
0:17:18 | a slight overkill |
---|
0:17:20 | although analysis or the data |
---|
0:17:23 | would provide the lexicon which could be used for that you more or three spotting |
---|
0:17:34 | a during the commanding it is up |
---|
0:17:38 | often |
---|
0:17:39 | there is often the need |
---|
0:17:41 | for |
---|
0:17:43 | verification |
---|
0:17:44 | or indication of the user |
---|
0:17:47 | this is sort of an si but i wanted to show you |
---|
0:17:51 | i |
---|
0:17:53 | really easy |
---|
0:17:55 | to enrol system for syndication |
---|
0:17:58 | because customarily would have their customer quality in times so you can get their voice |
---|
0:18:05 | so we start with a colour calling for an icon number login or whatever |
---|
0:18:11 | and it's difficult account does not exist would go to an agent |
---|
0:18:15 | but if the account does exist |
---|
0:18:17 | then |
---|
0:18:18 | we look at the user models and if it's a indicated |
---|
0:18:22 | then we she can choose not necessarily but it may choose to add |
---|
0:18:28 | that information |
---|
0:18:30 | to the customer data for adaptation |
---|
0:18:33 | however the authentication failed we going to form authentication which would be soaring to the |
---|
0:18:42 | customer challenge |
---|
0:18:44 | a questions and they don't wear answered correctly that |
---|
0:18:48 | user would again we also indicated and their speech would be sent |
---|
0:18:53 | to the data for training so that the next time they would be automatically verified |
---|
0:18:59 | it failed again we go to a human operator |
---|
0:19:04 | so this |
---|
0:19:06 | is an extremely easy to implement a use paradigm for percent age |
---|
0:19:15 | next |
---|
0:19:17 | application |
---|
0:19:19 | i called form filling application |
---|
0:19:22 | and it involves many |
---|
0:19:25 | type of an application such as travel |
---|
0:19:28 | a reservation |
---|
0:19:30 | appointment |
---|
0:19:32 | many simple transactions |
---|
0:19:34 | and |
---|
0:19:35 | which could be back to in section are still store transaction |
---|
0:19:40 | and these type of our application |
---|
0:19:43 | a there are many fields to be filled |
---|
0:19:46 | in order to be able to execute |
---|
0:19:49 | they request |
---|
0:19:52 | i have taken the liberty of |
---|
0:19:54 | writing out the script |
---|
0:19:56 | of what |
---|
0:19:57 | i generally use one i want to find out that might trained is running on |
---|
0:20:01 | time |
---|
0:20:03 | and |
---|
0:20:04 | this is a less the state-of-the-art in |
---|
0:20:08 | for form filling up with patients today |
---|
0:20:14 | as you can see |
---|
0:20:16 | it's a |
---|
0:20:17 | very strenuous process |
---|
0:20:21 | so present |
---|
0:20:23 | they technology as the one where you computer initiated dialogue |
---|
0:20:28 | it is well designed for confirmation and does a fairly good job of error detection |
---|
0:20:34 | but it's not really an example of asr you |
---|
0:20:38 | and not really the state-of-the-art in the technology |
---|
0:20:41 | it's just what is available out there today |
---|
0:20:47 | by contrast |
---|
0:20:50 | this has nothing to do with me although it is darpa |
---|
0:20:53 | darpa did run the program whole at this |
---|
0:20:57 | many years ago |
---|
0:20:59 | and there's was really a state-of-the-art program |
---|
0:21:02 | using mixed initiative dialogue |
---|
0:21:06 | being able to fill many of the entries in the form |
---|
0:21:10 | with a single utterance with good error detection |
---|
0:21:14 | and clarification dialogue |
---|
0:21:17 | and |
---|
0:21:18 | now application that i showed before would be much better it should look like this |
---|
0:21:23 | where you can say something like what times the train from new york right in |
---|
0:21:28 | front of well one |
---|
0:21:31 | and since you didn't say which what the data was machine just simply know that |
---|
0:21:36 | was missing in the form |
---|
0:21:38 | and ask you for that for each |
---|
0:21:43 | again we look and into the other line technology |
---|
0:21:48 | my opinion is that this is best served what the |
---|
0:21:53 | syntactic analysis shallow semantics |
---|
0:21:56 | is a possibility but |
---|
0:21:59 | not necessary for most of these applications |
---|
0:22:03 | so it would be easy to implement |
---|
0:22:05 | as long as you have |
---|
0:22:07 | are fairly robust |
---|
0:22:09 | analysis of the syntax |
---|
0:22:12 | and |
---|
0:22:14 | it also may help |
---|
0:22:21 | that paradigm for machine learning |
---|
0:22:25 | would be difficult to generalise to other applications but could usable enough training |
---|
0:22:32 | however more |
---|
0:22:34 | or phrase spotting would not think of set of structuring solution because |
---|
0:22:38 | you'd have too many keywords in each phrase uttered in the |
---|
0:22:49 | okay |
---|
0:22:51 | i have the signal |
---|
0:22:57 | i'm going to a |
---|
0:22:59 | change based now in going to |
---|
0:23:03 | speech translation application |
---|
0:23:06 | before continuing like to play a very short segment of videotape |
---|
0:23:12 | and |
---|
0:23:13 | i know that your recognizer at least one culprit the video and many of you |
---|
0:23:18 | will probably recognized extracting |
---|
0:23:29 | how i'd like to buy pesetas |
---|
0:23:34 | but i |
---|
0:23:37 | note this adorable formally and if you kevin |
---|
0:23:41 | i mean my |
---|
0:23:44 | here's my passport |
---|
0:23:50 | what is the exchange rate between us dollars and pesetas |
---|
0:24:02 | okay so |
---|
0:24:05 | this finding out that is |
---|
0:24:07 | the first |
---|
0:24:10 | bilingual |
---|
0:24:11 | dialogue or speech-to-speech translation paradigm |
---|
0:24:16 | not reliable and i'm not sure whether is here today |
---|
0:24:20 | as disputed |
---|
0:24:23 | and the parents that cmu |
---|
0:24:25 | you know balance we first do this |
---|
0:24:30 | i'm not sure whether he's right on that because |
---|
0:24:34 | when this was implemented a |
---|
0:24:37 | there was no asr system the trend in real time computer |
---|
0:24:41 | and this the of course balanced it can start |
---|
0:24:45 | special hardware consisting of twelve |
---|
0:24:49 | the S P modules running parallel seems to be able to the asr in more |
---|
0:24:55 | or less real time there is slightly later |
---|
0:24:58 | but |
---|
0:25:00 | was an accomplishment in that sense |
---|
0:25:05 | the system |
---|
0:25:07 | consisted of a speech recognizer |
---|
0:25:12 | with a |
---|
0:25:15 | specific grammar for the application |
---|
0:25:19 | of a lingual parser |
---|
0:25:21 | we only bilingual translator |
---|
0:25:24 | not really a translator but it was bilingual translator |
---|
0:25:29 | to text-to-speech modules |
---|
0:25:31 | which |
---|
0:25:33 | a speech out but |
---|
0:25:36 | it was probably better to describe system and i can see what's actively involved but |
---|
0:25:42 | i think it was that |
---|
0:25:44 | around four hundred words |
---|
0:25:47 | keyboards in each of the system of course the translation what's |
---|
0:25:51 | quite straightforward since |
---|
0:25:53 | you know what boards were |
---|
0:26:01 | a two days a bilingual |
---|
0:26:04 | you meant dialogue |
---|
0:26:06 | is quite different |
---|
0:26:08 | the underlying technology has been replaced by a generalized |
---|
0:26:13 | so today statistical machine translation |
---|
0:26:18 | okay |
---|
0:26:19 | present applications |
---|
0:26:21 | are quite good |
---|
0:26:23 | forcing single parent restricted domain applications |
---|
0:26:28 | they're not as robust but still extremely good for under strict dial |
---|
0:26:35 | but the single turn |
---|
0:26:37 | is not accurate enough for multi turned dialogues i think we're all familiar or maybe |
---|
0:26:43 | not what the all |
---|
0:26:46 | what the or |
---|
0:26:48 | a telephone game where you say something to your neighbour and the continues along time |
---|
0:26:53 | within |
---|
0:26:55 | it has no resemblance to what the message was originally |
---|
0:27:00 | and of course this is what will happen |
---|
0:27:04 | since the to convergence |
---|
0:27:06 | do not understand each other language |
---|
0:27:09 | so there's address the can hear need for clarification in this disambiguation |
---|
0:27:15 | which would result in human-machine dialogue |
---|
0:27:19 | for the translation happens |
---|
0:27:22 | and there's also need to understand the context |
---|
0:27:26 | core friends and so on endeavoured to be able to succeed with a multi turn |
---|
0:27:33 | freeform conversation |
---|
0:27:39 | well known to come and the control |
---|
0:27:43 | i will describe |
---|
0:27:44 | three applications |
---|
0:27:47 | a |
---|
0:27:48 | personally agents |
---|
0:27:50 | computer user interface by voice and robot control |
---|
0:27:59 | this is another |
---|
0:28:02 | project |
---|
0:28:04 | the last project that we did before |
---|
0:28:07 | we some closed it's doors on bell labs |
---|
0:28:10 | which was a personal agent |
---|
0:28:13 | in those days and then it was quite different and it's to the egg rolls |
---|
0:28:20 | in two thousand and one i don't think |
---|
0:28:23 | we force all there |
---|
0:28:25 | prevalence of |
---|
0:28:27 | smart |
---|
0:28:29 | what we don't colour phone |
---|
0:28:31 | in those days mobile phones |
---|
0:28:34 | strictly were used for voice |
---|
0:28:36 | so this type of replication was extremely necessary |
---|
0:28:41 | so it consisted of a variety of branches we did not get to do too |
---|
0:28:45 | many of them |
---|
0:28:47 | but we did manage to |
---|
0:28:50 | a |
---|
0:28:52 | do |
---|
0:28:53 | function for |
---|
0:28:54 | remote reading and writing of email services |
---|
0:29:00 | so it was partially implemented at bell labs and two thousand and one |
---|
0:29:05 | was it will dialogue capabilities |
---|
0:29:11 | the advantage for this system was that it could |
---|
0:29:14 | quality and |
---|
0:29:16 | a lexicon depending on the task |
---|
0:29:20 | so for example if you're given a day |
---|
0:29:23 | that you're interested in an email you could collect all the nine |
---|
0:29:27 | and subjects for that they so the one who pro |
---|
0:29:31 | so that they to see |
---|
0:29:34 | and have an email remotely right to down |
---|
0:29:38 | there was a error detection |
---|
0:29:40 | and clarification dialogue |
---|
0:29:43 | but in addition |
---|
0:29:44 | there was a test task dependent |
---|
0:29:48 | what men |
---|
0:29:49 | so this system did not need any startup training |
---|
0:29:54 | there were quite a few other systems of this nature at that time and they |
---|
0:30:00 | also for the mice because they required by to have our training |
---|
0:30:05 | and very few customers for willing to spend time |
---|
0:30:09 | this is not important |
---|
0:30:10 | less than that i will touch and lighter in my conclusion |
---|
0:30:18 | we talk about |
---|
0:30:20 | computer voice interface |
---|
0:30:25 | it was originally conceived as that a lengthening interface |
---|
0:30:30 | because |
---|
0:30:31 | if you wanted to probe your |
---|
0:30:33 | computer remotely a |
---|
0:30:36 | there was another way to do it in as i said that has disappeared to |
---|
0:30:40 | the |
---|
0:30:43 | a margin so smart phones |
---|
0:30:47 | but it does contribute to ease of use |
---|
0:30:51 | and especially states to handicap |
---|
0:30:56 | the mouse in this case is and headed |
---|
0:30:59 | the mentioned for |
---|
0:31:01 | multimodal use |
---|
0:31:04 | but of course one could |
---|
0:31:06 | also use a gestures |
---|
0:31:09 | and i tracking care your computer is equipped to do that |
---|
0:31:15 | it does enhance the interactions |
---|
0:31:18 | so for example if you're word and excel sheet |
---|
0:31:21 | a |
---|
0:31:22 | so out of having to write the formulas you could simply |
---|
0:31:26 | a |
---|
0:31:26 | verbalise |
---|
0:31:28 | without the model by saying average on three or |
---|
0:31:32 | with a mouse simply point |
---|
0:31:34 | to the column or with your finger |
---|
0:31:37 | and say average this call |
---|
0:31:42 | and finally |
---|
0:31:44 | robotic command and control |
---|
0:31:48 | okay a |
---|
0:31:50 | nelson showed us a at all |
---|
0:31:53 | but time |
---|
0:31:55 | few weeks ago hours the visiting my granddaughter and she actually has the story and |
---|
0:32:01 | this is not |
---|
0:32:02 | a |
---|
0:32:03 | i think voice response story it's actually |
---|
0:32:07 | training by the child and does all sorts of things like set and calm and |
---|
0:32:12 | you can see |
---|
0:32:14 | my resilience like no |
---|
0:32:22 | i'm sure many of your seen in the robotic |
---|
0:32:25 | would be wildly |
---|
0:32:27 | which was a garbage one thing |
---|
0:32:30 | the vice robotic device not voice control |
---|
0:32:36 | this is a device a used by the military to |
---|
0:32:41 | explore spaces |
---|
0:32:43 | and that's use bonds |
---|
0:32:46 | generally it's not used what's voice control but |
---|
0:32:50 | activated by joystick |
---|
0:32:53 | what if |
---|
0:32:54 | the soldiers not have time to wait for it to explore the space before they |
---|
0:32:59 | would enter |
---|
0:33:00 | first control would certainly help |
---|
0:33:03 | and finally |
---|
0:33:05 | this is a program run the vault that are |
---|
0:33:09 | what the strange name a big door |
---|
0:33:12 | i don't know why it's called big door |
---|
0:33:15 | you all would probably better because it's meant to carry a |
---|
0:33:23 | a lot of provision so that the soldiers not too late with a |
---|
0:33:29 | the weight |
---|
0:33:31 | and a this particular device can certainly use voice control because it is accompanying this |
---|
0:33:37 | altar and soldier |
---|
0:33:39 | needs to remain hands-free and i three to be able to operate |
---|
0:33:45 | so one |
---|
0:33:48 | it is found in torrance |
---|
0:33:50 | and extremely useful for both commercial |
---|
0:33:53 | and military purposes |
---|
0:33:57 | big tall as they showed before is a companion to a soldier |
---|
0:34:02 | and it's the perfect setup |
---|
0:34:04 | for multi modal communication |
---|
0:34:07 | because when you have your tonight three it certainly is one more natural to select |
---|
0:34:12 | a big door |
---|
0:34:14 | go there and point to it |
---|
0:34:16 | or |
---|
0:34:17 | have it fall or your gaze |
---|
0:34:20 | and |
---|
0:34:24 | on the other thing that i and added here is |
---|
0:34:29 | the reporters of multimodal communication |
---|
0:34:32 | could be found in yours |
---|
0:34:34 | where |
---|
0:34:35 | there were about itself |
---|
0:34:37 | wouldn't use gesture |
---|
0:34:38 | is direction finder |
---|
0:34:44 | so |
---|
0:34:46 | i would like not to address |
---|
0:34:50 | what i think |
---|
0:34:51 | is necessary for the future |
---|
0:34:55 | and |
---|
0:34:57 | obviously for asr |
---|
0:35:00 | we still have |
---|
0:35:02 | a problem |
---|
0:35:04 | where its robustness to noise |
---|
0:35:07 | channel conditions |
---|
0:35:10 | i believe that is being worked on |
---|
0:35:16 | but there is |
---|
0:35:18 | and word making problem |
---|
0:35:21 | a language modeling which prevents the technology from being robust |
---|
0:35:26 | for topic in general |
---|
0:35:29 | very often |
---|
0:35:31 | we train |
---|
0:35:33 | well lots of data for a specific on the right switch to a different genre |
---|
0:35:39 | a |
---|
0:35:40 | the accuracy falls very drastically |
---|
0:35:45 | so i don't believe that we need |
---|
0:35:48 | spend a lot of effort |
---|
0:35:50 | researching language models |
---|
0:35:54 | and i had the luxury few years ago to |
---|
0:35:59 | have an experiment done |
---|
0:36:02 | because i was curious as to |
---|
0:36:05 | how does computer phone like a phonetic transcription relates to implement phonetic transcription |
---|
0:36:13 | a most people believe that humans are extremely adapted phonetic transcriptions |
---|
0:36:19 | and i believe that is because many of the experiments that have been done |
---|
0:36:25 | in transcribing |
---|
0:36:27 | phonetic so done in artificial settings and results are much higher |
---|
0:36:32 | then |
---|
0:36:34 | should be |
---|
0:36:36 | so |
---|
0:36:37 | we ran an experiment where we ask human trends to transcribe speech naturally |
---|
0:36:43 | except that they have no |
---|
0:36:46 | lexical semantic or you even phonotactic information |
---|
0:36:52 | to do that |
---|
0:36:53 | shows two languages with an extremely similar phoneme set |
---|
0:36:58 | have one set the native speakers speak one language and have another set of native |
---|
0:37:04 | speakers |
---|
0:37:06 | transcribe that in their own language |
---|
0:37:08 | as best they could |
---|
0:37:11 | experiment was actually cherry with |
---|
0:37:14 | and additional language i will surely the results for the first two languages which were |
---|
0:37:19 | japanese |
---|
0:37:20 | and italian |
---|
0:37:23 | which have a tremendous overlap phonemes and as you can see here |
---|
0:37:27 | i guess are had a |
---|
0:37:31 | thirty four point nine phone error rate |
---|
0:37:34 | the average human head twenty nine point nine |
---|
0:37:38 | the best thing when had seventeen point two but the words |
---|
0:37:43 | much exceeded the machine |
---|
0:37:46 | humans have no trouble understanding even thirty seven point five percent |
---|
0:37:52 | phone error rate |
---|
0:37:55 | experiment was also done by using |
---|
0:37:58 | spanish and italian |
---|
0:38:00 | and of course |
---|
0:38:02 | there is |
---|
0:38:04 | quite a bit of phonotactic over a wide and some lexical overlap and the results |
---|
0:38:10 | for |
---|
0:38:11 | spanish-italian much higher |
---|
0:38:13 | but when you're bored of any kind of language models and phonotactic models |
---|
0:38:19 | obviously |
---|
0:38:21 | the machines are doing almost as well there is really here for about |
---|
0:38:26 | fifty percent relative improvement |
---|
0:38:30 | i might add that the recognizer use the here was not that the neural net |
---|
0:38:34 | recognizer and we're beginning to see that fifteen percent relative improvement |
---|
0:38:41 | so |
---|
0:38:42 | maybe some |
---|
0:38:43 | the machines well matched the human ability to transcribe |
---|
0:38:51 | going on |
---|
0:38:53 | people always talk about |
---|
0:38:55 | prosodic analysis in asr |
---|
0:38:58 | but data |
---|
0:38:59 | so far there has been very little research |
---|
0:39:03 | it's not important |
---|
0:39:05 | a for transcription |
---|
0:39:07 | or one way translation |
---|
0:39:10 | but it's extremely important for dialogue goes |
---|
0:39:14 | intent |
---|
0:39:15 | does drive to dial |
---|
0:39:21 | those of you who've known me in the past will probably wondering why i didn't |
---|
0:39:25 | say much about text-to-speech so far |
---|
0:39:28 | but to |
---|
0:39:31 | that technology has a really taking a turn |
---|
0:39:36 | in some respects for the better but in many respects for the words |
---|
0:39:40 | a |
---|
0:39:41 | it sounds a lot more natural than it did |
---|
0:39:45 | in the nineties |
---|
0:39:47 | because of the |
---|
0:39:50 | all right hmm models and other large vocabulary large data |
---|
0:39:55 | synthesis |
---|
0:39:57 | but presently has |
---|
0:39:59 | fairly much disappeared from text to speech |
---|
0:40:04 | again it may not be important |
---|
0:40:07 | if you're expecting a once and actually spawn |
---|
0:40:11 | but |
---|
0:40:13 | if you're trying to listen profile paragraph i guarantee that you will not have much |
---|
0:40:18 | comprehension |
---|
0:40:22 | the present to me that he of text-to-speech |
---|
0:40:26 | still does quality evaluations but as part i know |
---|
0:40:30 | a they don't too much comprehension evaluation of my cat cup with the community so |
---|
0:40:35 | i'm not sure but i think it would be who |
---|
0:40:38 | to do |
---|
0:40:40 | an experiment which we actually did years ago which it's present a very large complex |
---|
0:40:45 | paragraph |
---|
0:40:46 | we attacks the speech |
---|
0:40:48 | and then do college or like |
---|
0:40:51 | multi |
---|
0:40:53 | choice questions and see how much is reading |
---|
0:41:01 | for these applications |
---|
0:41:03 | error detection and what localisation is extremely important |
---|
0:41:09 | i make it |
---|
0:41:16 | and |
---|
0:41:21 | my computer had problems here |
---|
0:41:23 | and we need the dialogue for error recovery |
---|
0:41:28 | also dialogue for help menu is extremely important to facilitate a |
---|
0:41:35 | applications |
---|
0:41:37 | and finally |
---|
0:41:39 | joint optimization between the asr and their application |
---|
0:41:43 | a quite often |
---|
0:41:45 | reduces the error for the application |
---|
0:41:48 | even if it may increase the word error rate for the asr |
---|
0:41:53 | and we have seen that |
---|
0:41:55 | repeatedly and |
---|
0:41:56 | various programs where we're at the either |
---|
0:42:00 | transcriptions from speech are transcription from and writing |
---|
0:42:05 | going to speech translation or joint optimization actually all |
---|
0:42:19 | we cannot do a |
---|
0:42:21 | for this community for many of the problems that are preventing |
---|
0:42:28 | certain applications |
---|
0:42:30 | to become deplorable |
---|
0:42:32 | there has to be a lot more work in Q and they and the information |
---|
0:42:36 | retrieval |
---|
0:42:39 | there has been working on that but i don't believe that the accuracy is such |
---|
0:42:44 | that would satisfy |
---|
0:42:48 | kind of customers that |
---|
0:42:51 | what call into it |
---|
0:42:53 | may have it does have a lot of value in |
---|
0:42:58 | more |
---|
0:43:00 | type of analysis work |
---|
0:43:02 | but |
---|
0:43:05 | we have to have |
---|
0:43:07 | very blessed false alarm and |
---|
0:43:10 | a lot more |
---|
0:43:14 | detection |
---|
0:43:15 | before we can actually do qualities |
---|
0:43:18 | and i know that |
---|
0:43:21 | it's my turned back and we well |
---|
0:43:25 | we will is the giant and information retrieval |
---|
0:43:28 | and it does have hundred percent recall |
---|
0:43:32 | but it also had zero percent precision |
---|
0:43:38 | and one |
---|
0:43:39 | should not expect |
---|
0:43:41 | to get |
---|
0:43:42 | responses |
---|
0:43:45 | with zero percent precision can we actually for |
---|
0:43:50 | doing we had |
---|
0:43:54 | one aspect of gale walls |
---|
0:43:56 | a |
---|
0:43:57 | what we call this relation which was a very different responses |
---|
0:44:03 | targeted |
---|
0:44:05 | and |
---|
0:44:06 | when danced |
---|
0:44:07 | applying where it was important who they want to one |
---|
0:44:13 | and we had one such example |
---|
0:44:16 | or was one more prevalent |
---|
0:44:18 | for those who |
---|
0:44:19 | to go down the wound up there who |
---|
0:44:23 | a the first fifty responses by google were all the reverse |
---|
0:44:29 | well |
---|
0:44:30 | the gale distillation was actually able to pick |
---|
0:44:33 | but |
---|
0:44:34 | still think that there's a lot more work |
---|
0:44:39 | again there should be a lot more work done in unrestricted bilingual dialogue |
---|
0:44:53 | what don |
---|
0:44:56 | one of the things that |
---|
0:44:59 | prevent this |
---|
0:45:00 | technology from going |
---|
0:45:02 | for |
---|
0:45:03 | is that there is a need for platforms that one |
---|
0:45:08 | a lot of the platforms |
---|
0:45:10 | i haven't done as an experiment |
---|
0:45:14 | so for example |
---|
0:45:17 | if you have |
---|
0:45:19 | hey |
---|
0:45:23 | dialogue system whatsoever about or with your desktop |
---|
0:45:28 | whenever it encounters an oov if you can explain that word and habit reading that |
---|
0:45:34 | or whenever it encounters a construction |
---|
0:45:37 | that it does not understand |
---|
0:45:39 | comes back with their clarification dialogue and you can explain it |
---|
0:45:43 | eventually subsystems would become smarter |
---|
0:45:48 | and better |
---|
0:45:51 | systems |
---|
0:45:52 | should be eventually configured to be able to do planning and inference |
---|
0:45:57 | and finally |
---|
0:46:00 | although |
---|
0:46:01 | just before i probably not well i started the program |
---|
0:46:05 | and grounded language acquisition for the full a i semantics |
---|
0:46:10 | that did not go very far but i do believe that there's room |
---|
0:46:14 | to do a lot of research in this area |
---|
0:46:20 | with my final slide on trips like for a little about the choice of applications |
---|
0:46:27 | so when designing an application they should be |
---|
0:46:31 | customers |
---|
0:46:32 | a standard applications |
---|
0:46:36 | applications with too many false alarms |
---|
0:46:39 | that is |
---|
0:46:40 | a router with too many |
---|
0:46:42 | but routes |
---|
0:46:44 | a |
---|
0:46:45 | engenders lack of trust by the customer |
---|
0:46:49 | the number of misses is not that's crucial but it is application dependent |
---|
0:46:54 | because you can always have sort fail for |
---|
0:46:58 | when you miss |
---|
0:46:59 | an action |
---|
0:47:02 | but it is also important |
---|
0:47:04 | to reduce the cost of enrolment and the cost of learning i specific application |
---|
0:47:12 | which is usually done by |
---|
0:47:15 | a machine itself detecting errors and correct them |
---|
0:47:21 | it is |
---|
0:47:23 | is important to design compelling applications some applications maybe |
---|
0:47:29 | easy to implement |
---|
0:47:31 | but unless they have an urgent |
---|
0:47:33 | need |
---|
0:47:34 | the most likely will fail |
---|
0:47:39 | it's also always wise to ensure that your application |
---|
0:47:44 | it is compelling selected as an alternative |
---|
0:47:49 | way to accomplish the task |
---|
0:47:54 | five there is alternative again the application |
---|
0:47:58 | well disappear |
---|
0:48:00 | and |
---|
0:48:01 | finally i'd like to and this on a real positive know which is good news |
---|
0:48:09 | and you're all for a |
---|
0:48:12 | bill gates this and that speech is the most natural form of communication |
---|
0:48:19 | and |
---|
0:48:20 | where actually saying at |
---|
0:48:22 | speech and multi modality despite |
---|
0:48:26 | their prevalence of smart phones it's not disappear |
---|
0:48:30 | and |
---|
0:48:32 | many of their |
---|
0:48:34 | internet giants are |
---|
0:48:37 | investing |
---|
0:48:38 | heavily |
---|
0:48:40 | speech technology |
---|
0:48:53 | any questions |
---|
0:49:02 | some |
---|
0:49:09 | someone might mistakenly get the impression from the part where you quoted the hot comparatively |
---|
0:49:16 | low error rates for the machine and on phones |
---|
0:49:20 | that there's nothing to be done an acoustic part i don't think you think that |
---|
0:49:24 | is your the other bullet about noise and reverberation where i think probably to machines |
---|
0:49:28 | and fail much faster than deep |
---|
0:49:33 | well i |
---|
0:49:34 | as i said there's still that fifteen percent |
---|
0:49:38 | and |
---|
0:49:40 | no more than that i mean because i don't define it should experiment with the |
---|
0:49:44 | noise level finish my higher |
---|
0:49:49 | which buttons so there is still that fifteen percent |
---|
0:49:52 | but also if you know this one of the humans actually did twice as well |
---|
0:49:58 | as the machine and there's no reason to assume that the machines can do that |
---|
0:50:05 | well either so yes there is plenty of room for improvement |
---|
0:50:09 | and this amount fact there's no reason to assume that the machines can do better |
---|
0:50:13 | than human |
---|
0:50:15 | there are many tasks |
---|
0:50:17 | a specifically a speaker verification where machines are more capable than humans to do it |
---|
0:50:25 | so i'm not saying that a |
---|
0:50:28 | as far as noise is concerned i would love to learn the same experiment in |
---|
0:50:34 | or is that was run for clean speech because i think human phonetic recognition is |
---|
0:50:40 | in noise will drop weight down |
---|
0:50:43 | just like the machine a |
---|
0:50:45 | used to use alternative strategies to be able to transcribe speech they don't just use |
---|
0:50:52 | the phone set |
---|
0:50:54 | they have a lot more knowledge which is in the language model of the syntax |
---|
0:51:00 | semantics |
---|
0:51:01 | and |
---|
0:51:03 | yes so there is plenty of room to do research and acoustics |
---|
0:51:08 | but other parts are really lagging we have been start with n-gram models |
---|
0:51:14 | okay so we will have done as for translation i don't know about for transcription |
---|
0:51:21 | seven n-gram models space becomes extremely flat and i have always to use the same |
---|
0:51:29 | example |
---|
0:51:30 | if i have a bunch of words followed by the that followed by a lot |
---|
0:51:35 | of words followed by toward shoe followed by a lot of words provided |
---|
0:51:40 | word then |
---|
0:51:41 | they're chew bone are much more compelling that there were really hairy black |
---|
0:51:49 | while sitting you know outside my challenge |
---|
0:51:53 | so |
---|
0:51:55 | yes i think that although |
---|
0:51:58 | many of my colleagues have assured me that it's been trying to i think it |
---|
0:52:01 | should be tried again |
---|
0:52:03 | try to find |
---|
0:52:05 | that are rolling then what we are using |
---|
0:52:11 | more |
---|
0:52:14 | so well most of us here with nist maybe one are in the cycle or |
---|
0:52:20 | out of their of an R N B psycho in speech technologies |
---|
0:52:24 | or you probably witness like whole bunch of this cycle so is there something that |
---|
0:52:28 | surprised you |
---|
0:52:30 | in the last time something that you basically were not expecting and |
---|
0:52:34 | okay |
---|
0:52:42 | i would say that to |
---|
0:52:45 | in this sense nothing surprise me but |
---|
0:52:51 | i think the technology is continuing on an upward trend in |
---|
0:52:58 | all aspects of the technology the language as well as the transcription |
---|
0:53:03 | the cycles are very long and are |
---|
0:53:07 | we want to wanna get a break through the use you |
---|
0:53:11 | that points function and the rest of the time they are incremental |
---|
0:53:18 | i don't know whenever i discussed this nobody seems to recall it but a full |
---|
0:53:24 | we gave a |
---|
0:53:26 | and by that for you there i can support interspeech i don't remember which one |
---|
0:53:31 | it was but it was in hawaii |
---|
0:53:34 | where he was |
---|
0:53:36 | the money in the fact that speech recognition improvements and the nineteen eighty five instances |
---|
0:53:42 | then all the effort has been in application |
---|
0:53:46 | i don't really what that observation |
---|
0:53:50 | but progress is very slow down |
---|
0:53:55 | where lower |
---|
0:53:56 | near |
---|
0:53:57 | the ability to |
---|
0:53:59 | transcribe and restricted word well all genre in all |
---|
0:54:05 | or be able to understand |
---|
0:54:08 | and you |
---|
0:54:10 | so |
---|
0:54:14 | might don't really basically consisted all |
---|
0:54:17 | doable application |
---|