Speech Transcript - Utilization of ASRU technology - present and future

0:00:16	okay a first i wanted to thank the committee for and very here
0:00:22	i data
0:00:24	when i was reading the speech i had a lot of fun because it can
0:00:27	be back to the old days and
0:00:30	i'll cover a lot of history and i have precedent from now but i think
0:00:36	we can learn a lot from
0:00:38	a
0:00:40	hence ventures steaks and so on
0:00:45	a
0:00:47	there's also
0:00:49	very short cycle and research some more weeping to eighteen to twenty years everybody for
0:00:55	us what was done before so it's always
0:00:58	nice to review
0:01:03	like to
0:01:05	you have someone a tremendous before i start and many people were involved in
0:01:11	well the brunt instead i will describe
0:01:14	but i one
0:01:16	special
0:01:17	acknowledgment for my cold calling generally
0:01:21	whose expertise
0:01:23	knowledge and imagination lead to a lot about this crime
0:01:30	so with all that i will proceed might walk
0:01:34	and dropped is asr you and where now in the third day of
0:01:41	asr you and so what is for two days we have lots of talk about
0:01:46	star
0:01:48	but the you has been the thing
0:01:51	so are you trying to somehow feel that
0:01:56	and down
0:01:57	and so on yahoo he's any branch
0:02:00	a larger family of
0:02:03	applications which usually is referred to as natural language processing
0:02:10	in the
0:02:11	natural language processing
0:02:14	usually consists of variety of inputs
0:02:19	to most people unicode or typed input
0:02:24	would seem to be the simplest
0:02:27	does not require transcription and four
0:02:31	most languages you have things like word boundaries and punctuation although when you're typing you
0:02:37	may not i punctuations but
0:02:39	when you do returned you or something that that's the end of your request
0:02:44	but it has certain problems like home of graphs
0:02:50	probably some most words in however problems that occur
0:02:54	when you're trying to get any representation from
0:02:58	the input
0:03:01	i wrote here hardcopy but alignment was hand handwritten input
0:03:07	and a
0:03:09	it shows a lot about the difficulties when typed input
0:03:14	but it has the and the difficulty that
0:03:16	we require transcription
0:03:19	it's not as bad as when you're dealing with hardcopy because it online and you
0:03:23	have a contract
0:03:26	the stroke and consequently
0:03:29	probably get a lot less errors but it still challenging
0:03:33	i speech
0:03:35	in the sense shares the same different properties that
0:03:38	handwriting input chair
0:03:41	but it has done and also of course and speech shot we do have the
0:03:45	problem of deciding what
0:03:47	things like for one
0:03:50	but speech does have a single feature that is not common to there
0:03:54	first two
0:03:55	which is presently
0:03:57	which in these particular system in my opinion is extremely important
0:04:05	one you trying to transcribe
0:04:07	speech just for transcription say
0:04:11	it really doesn't matter
0:04:13	but when you're trying to and for the intended needing
0:04:17	presently may or may not play the role
0:04:20	and by whatever example
0:04:23	a
0:04:25	just a simple
0:04:27	question like
0:04:28	is there is a well
0:04:30	depending on
0:04:32	whether you start toward this or one
0:04:35	there is still remains ambiguous somewhat but you're something
0:04:39	that the response
0:04:40	well be noted that is a book
0:04:43	but when you since the word but
0:04:45	the response time would be you know that is a magazine or no or whatever
0:04:50	so that
0:04:52	that ambiguity is not the resulting recyclable from text alone especially in this situation in
0:04:59	a dialogue situation
0:05:02	also going on
0:05:06	i replaced
0:05:07	the meaning representation with applications
0:05:11	which will result in there in actions or probable probably responses
0:05:17	and
0:05:19	taking actually there
0:05:21	i've taken the liberty of making
0:05:24	three separate
0:05:27	application classes
0:05:29	a these are for my convenience for the car they're by no means
0:05:34	meant to be
0:05:35	a rule or and there is going to be some overlap between
0:05:41	some of these applications
0:05:43	but i will discuss
0:05:45	a these different applications it's like to like talk
0:05:50	but a bit what we can see
0:05:54	and from now on i will take what asr you which means that are going
0:05:58	to have speech and what
0:06:02	and data and you please applications will have to rely on the dialogue system
0:06:08	so in the next slide i will have a chart for example for style of
0:06:14	system
0:06:15	since i can explain so on
0:06:18	so basically
0:06:20	since i used to work for the telephone company the input this telephone but it
0:06:25	could be
0:06:26	just simply a microphone in but
0:06:29	and
0:06:31	next stage is it transcription task
0:06:34	what we
0:06:36	text and customarily that would be
0:06:40	large vocabulary continuous speech recognizer used
0:06:44	in the next stage we're going to try to extract meaning
0:06:49	and a meaning maybe application driven or maybe totally unrestricted
0:06:55	the second one is not within days not you because it requires pool semantic interpretation
0:07:04	but we'll talk a lot about the application environment
0:07:08	semantic rules
0:07:10	and when i say rules here i did not mean manually constructed rules necessarily
0:07:17	finally we get to the dialogue manager that has to make a decision
0:07:22	in there is a response or word error is really detected
0:07:28	a quite well we should
0:07:29	back to the user
0:07:33	and it's an action is necessary then in section will be invoked
0:07:39	and down
0:07:40	i'd like to spend a couple of minutes something about that
0:07:46	language analyser portion of this
0:07:49	and
0:07:51	again i will have a few suggestions for this but by no means these should
0:07:55	be thought of as
0:07:57	all encompassing
0:08:03	so
0:08:04	the simplest method is to use keyword or
0:08:09	free spotting
0:08:10	this immature technology which is very robust to asr
0:08:15	or is it is manually configured
0:08:19	but it is easy to change an application by simply adding
0:08:24	a
0:08:25	content to it
0:08:27	it does require an expert to design
0:08:31	next is what most people referred to as statistical methods
0:08:36	i don't like that because
0:08:39	statistical methods also referred to other aspects
0:08:43	like parsing
0:08:45	so i used a concept of machine learning from parallel corpora
0:08:50	and here you have speech on one side and a result actions on the other
0:08:55	side you can map them
0:08:58	fairly much the way speech translation system
0:09:02	this is of course a is fully automatic
0:09:06	but you do need to obtain data
0:09:09	in many applications that they that would be very easy to acquire
0:09:14	huh
0:09:15	the main drawback is that if you want to change or add something to your
0:09:20	application
0:09:21	need to do
0:09:23	additional training
0:09:28	syntactic analysis
0:09:30	would be very good for some applications
0:09:34	it is not as robust as some of the other technologies
0:09:39	but its application
0:09:42	can be trained with the specific genre or topic
0:09:47	then there
0:09:49	analysis can become very robust
0:09:54	and again it is
0:09:59	quite
0:09:59	easy to
0:10:02	change or extend applications
0:10:05	and it is also helpful in conjunction with asr
0:10:09	for our detection and localisation
0:10:14	you have some and text
0:10:15	just contributes additional information
0:10:18	when necessary for
0:10:20	a
0:10:21	the arguments themselves and
0:10:24	that's predicate argument analysis
0:10:27	very important for
0:10:30	queries which i'll discuss later in my talk
0:10:33	and finally
0:10:35	a deep semantics which are not discuss because it's really is not ready for prime
0:10:40	time
0:10:46	so i will start by discussing call center applications
0:10:49	and this is something
0:10:52	that
0:10:54	we did work on into like nineties
0:10:56	a one lucent was very involved in
0:11:01	small business switching units
0:11:06	the business is huge so it's commercially extremely while
0:11:12	of course it is much larger they and then estimated eighty billion dollars a lot
0:11:16	was quoted
0:11:19	nineties
0:11:21	it could that does not have to replace a human operator just cutting a human
0:11:25	operators time
0:11:27	a could result in tremendous savings
0:11:32	and down
0:11:34	now turn
0:11:36	to probably the first successful be deployed asr you application
0:11:42	which was the at and you operator systems
0:11:46	for this simple application
0:11:49	with natural language and but of course
0:11:53	what is not natural language analysis it was you
0:11:57	word or phrase spotting for only five words
0:12:02	and
0:12:03	right behind press to remember all five words
0:12:07	but i was employed by eighteen P
0:12:10	which at that time was
0:12:12	largest corporation in the world what six hundred
0:12:17	that's operators
0:12:19	so just cutting a few seconds of each
0:12:23	operator can query say accompanied approximately three hundred million dollars here
0:12:31	going back
0:12:33	i have a
0:12:35	list and here for applications for
0:12:39	a call center
0:12:43	call routing and form filling i will discuss in
0:12:47	great detail
0:12:50	unrestricted interactions which would be something like actually
0:12:54	probably my voice
0:12:55	a complete website of well store or business
0:13:01	is something that
0:13:02	will come up in later discussion
0:13:06	she when a an effect you
0:13:09	are not limited by the asr capabilities but by that an L P K bodies
0:13:16	so i will not discussed in much except in my conclusion
0:13:21	so if we start with a
0:13:25	quite a colour for call centre
0:13:28	we actually implement the one below items
0:13:31	ground turn the century
0:13:34	a data
0:13:36	the whiteboard was very question in
0:13:40	and there was a
0:13:42	matrix routing and the confidence scoring
0:13:45	and as well as some destination threshold
0:13:50	if everything was met
0:13:52	the car was routed
0:13:56	if either one of those tail
0:13:59	the system had an option of
0:14:02	we question
0:14:05	a standing
0:14:07	to an operator probably after
0:14:10	a trial
0:14:11	or requesting the user to
0:14:15	the request
0:14:17	but the ones one other branch to this
0:14:20	dialogue system
0:14:22	which ones when we encounter multiple destinations
0:14:27	multiple destination i will explain the night the next slide
0:14:33	this was evaluated
0:14:35	what
0:14:37	bank
0:14:38	an insurance company
0:14:41	what forty
0:14:43	routing destinations
0:14:45	and at that time
0:14:49	despite the fact that the asr was not
0:14:51	at the same level but it is very
0:14:53	wait ninety six percent routing accuracy
0:14:57	which is that the
0:15:01	there
0:15:02	false alarm rate was only about four percent
0:15:06	eight percent of those calls we're up to operator but we did not keep statistics
0:15:11	on how many nodes
0:15:14	where legitimate routes because they request was
0:15:17	totally out of domain and how many were actual classes
0:15:22	so the disambiguation die well
0:15:26	serves two purposes one
0:15:28	the customer may not know
0:15:30	the exact structure of
0:15:32	probably
0:15:34	three
0:15:36	and second would be to combine certain classes so that we have better separation more
0:15:42	success routing
0:15:44	so if the user should i'm looking for a used car alone
0:15:48	there will be only one branch that would a satisfying criterion
0:15:53	but the user may say either alone or track one track
0:15:58	not one of their
0:16:00	words in the vocabulary
0:16:02	then the machine would get them into a long
0:16:05	and start a dialogue
0:16:07	and what S
0:16:09	this is
0:16:10	an existing
0:16:12	i'm sorry one task
0:16:14	but you of the user option is the so called home or personal
0:16:20	once the use of santa carla going to that range
0:16:24	but not because there are only two options
0:16:27	a system would ask is this an existing long and the user signal it's and
0:16:32	one and L
0:16:34	the call one euro successfully
0:16:40	the underlying technology for this
0:16:42	was
0:16:44	want
0:16:45	or train spotting
0:16:46	which was easy to configure
0:16:49	it did require language check expertise
0:16:53	and a
0:16:55	what is extremely accurate especially when routing destinations for mine
0:17:01	and it was easy to a
0:17:04	adopts a new
0:17:05	right
0:17:07	the second alternative for this one again to train from parallel corpora
0:17:13	and
0:17:16	in my opinion it's
0:17:18	a slight overkill
0:17:20	although analysis or the data
0:17:23	would provide the lexicon which could be used for that you more or three spotting
0:17:34	a during the commanding it is up
0:17:38	often
0:17:39	there is often the need
0:17:41	for
0:17:43	verification
0:17:44	or indication of the user
0:17:47	this is sort of an si but i wanted to show you
0:17:51	i
0:17:53	really easy
0:17:55	to enrol system for syndication
0:17:58	because customarily would have their customer quality in times so you can get their voice
0:18:05	so we start with a colour calling for an icon number login or whatever
0:18:11	and it's difficult account does not exist would go to an agent
0:18:15	but if the account does exist
0:18:17	then
0:18:18	we look at the user models and if it's a indicated
0:18:22	then we she can choose not necessarily but it may choose to add
0:18:28	that information
0:18:30	to the customer data for adaptation
0:18:33	however the authentication failed we going to form authentication which would be soaring to the
0:18:42	customer challenge
0:18:44	a questions and they don't wear answered correctly that
0:18:48	user would again we also indicated and their speech would be sent
0:18:53	to the data for training so that the next time they would be automatically verified
0:18:59	it failed again we go to a human operator
0:19:04	so this
0:19:06	is an extremely easy to implement a use paradigm for percent age
0:19:15	next
0:19:17	application
0:19:19	i called form filling application
0:19:22	and it involves many
0:19:25	type of an application such as travel
0:19:28	a reservation
0:19:30	appointment
0:19:32	many simple transactions
0:19:34	and
0:19:35	which could be back to in section are still store transaction
0:19:40	and these type of our application
0:19:43	a there are many fields to be filled
0:19:46	in order to be able to execute
0:19:49	they request
0:19:52	i have taken the liberty of
0:19:54	writing out the script
0:19:56	of what
0:19:57	i generally use one i want to find out that might trained is running on
0:20:01	time
0:20:03	and
0:20:04	this is a less the state-of-the-art in
0:20:08	for form filling up with patients today
0:20:14	as you can see
0:20:16	it's a
0:20:17	very strenuous process
0:20:21	so present
0:20:23	they technology as the one where you computer initiated dialogue
0:20:28	it is well designed for confirmation and does a fairly good job of error detection
0:20:34	but it's not really an example of asr you
0:20:38	and not really the state-of-the-art in the technology
0:20:41	it's just what is available out there today
0:20:47	by contrast
0:20:50	this has nothing to do with me although it is darpa
0:20:53	darpa did run the program whole at this
0:20:57	many years ago
0:20:59	and there's was really a state-of-the-art program
0:21:02	using mixed initiative dialogue
0:21:06	being able to fill many of the entries in the form
0:21:10	with a single utterance with good error detection
0:21:14	and clarification dialogue
0:21:17	and
0:21:18	now application that i showed before would be much better it should look like this
0:21:23	where you can say something like what times the train from new york right in
0:21:28	front of well one
0:21:31	and since you didn't say which what the data was machine just simply know that
0:21:36	was missing in the form
0:21:38	and ask you for that for each
0:21:43	again we look and into the other line technology
0:21:48	my opinion is that this is best served what the
0:21:53	syntactic analysis shallow semantics
0:21:56	is a possibility but
0:21:59	not necessary for most of these applications
0:22:03	so it would be easy to implement
0:22:05	as long as you have
0:22:07	are fairly robust
0:22:09	analysis of the syntax
0:22:12	and
0:22:14	it also may help
0:22:21	that paradigm for machine learning
0:22:25	would be difficult to generalise to other applications but could usable enough training
0:22:32	however more
0:22:34	or phrase spotting would not think of set of structuring solution because
0:22:38	you'd have too many keywords in each phrase uttered in the
0:22:49	okay
0:22:51	i have the signal
0:22:57	i'm going to a
0:22:59	change based now in going to
0:23:03	speech translation application
0:23:06	before continuing like to play a very short segment of videotape
0:23:12	and
0:23:13	i know that your recognizer at least one culprit the video and many of you
0:23:18	will probably recognized extracting
0:23:29	how i'd like to buy pesetas
0:23:34	but i
0:23:37	note this adorable formally and if you kevin
0:23:41	i mean my
0:23:44	here's my passport
0:23:50	what is the exchange rate between us dollars and pesetas
0:24:02	okay so
0:24:05	this finding out that is
0:24:07	the first
0:24:10	bilingual
0:24:11	dialogue or speech-to-speech translation paradigm
0:24:16	not reliable and i'm not sure whether is here today
0:24:20	as disputed
0:24:23	and the parents that cmu
0:24:25	you know balance we first do this
0:24:30	i'm not sure whether he's right on that because
0:24:34	when this was implemented a
0:24:37	there was no asr system the trend in real time computer
0:24:41	and this the of course balanced it can start
0:24:45	special hardware consisting of twelve
0:24:49	the S P modules running parallel seems to be able to the asr in more
0:24:55	or less real time there is slightly later
0:24:58	but
0:25:00	was an accomplishment in that sense
0:25:05	the system
0:25:07	consisted of a speech recognizer
0:25:12	with a
0:25:15	specific grammar for the application
0:25:19	of a lingual parser
0:25:21	we only bilingual translator
0:25:24	not really a translator but it was bilingual translator
0:25:29	to text-to-speech modules
0:25:31	which
0:25:33	a speech out but
0:25:36	it was probably better to describe system and i can see what's actively involved but
0:25:42	i think it was that
0:25:44	around four hundred words
0:25:47	keyboards in each of the system of course the translation what's
0:25:51	quite straightforward since
0:25:53	you know what boards were
0:26:01	a two days a bilingual
0:26:04	you meant dialogue
0:26:06	is quite different
0:26:08	the underlying technology has been replaced by a generalized
0:26:13	so today statistical machine translation
0:26:18	okay
0:26:19	present applications
0:26:21	are quite good
0:26:23	forcing single parent restricted domain applications
0:26:28	they're not as robust but still extremely good for under strict dial
0:26:35	but the single turn
0:26:37	is not accurate enough for multi turned dialogues i think we're all familiar or maybe
0:26:43	not what the all
0:26:46	what the or
0:26:48	a telephone game where you say something to your neighbour and the continues along time
0:26:53	within
0:26:55	it has no resemblance to what the message was originally
0:27:00	and of course this is what will happen
0:27:04	since the to convergence
0:27:06	do not understand each other language
0:27:09	so there's address the can hear need for clarification in this disambiguation
0:27:15	which would result in human-machine dialogue
0:27:19	for the translation happens
0:27:22	and there's also need to understand the context
0:27:26	core friends and so on endeavoured to be able to succeed with a multi turn
0:27:33	freeform conversation
0:27:39	well known to come and the control
0:27:43	i will describe
0:27:44	three applications
0:27:47	a
0:27:48	personally agents
0:27:50	computer user interface by voice and robot control
0:27:59	this is another
0:28:02	project
0:28:04	the last project that we did before
0:28:07	we some closed it's doors on bell labs
0:28:10	which was a personal agent
0:28:13	in those days and then it was quite different and it's to the egg rolls
0:28:20	in two thousand and one i don't think
0:28:23	we force all there
0:28:25	prevalence of
0:28:27	smart
0:28:29	what we don't colour phone
0:28:31	in those days mobile phones
0:28:34	strictly were used for voice
0:28:36	so this type of replication was extremely necessary
0:28:41	so it consisted of a variety of branches we did not get to do too
0:28:45	many of them
0:28:47	but we did manage to
0:28:50	a
0:28:52	do
0:28:53	function for
0:28:54	remote reading and writing of email services
0:29:00	so it was partially implemented at bell labs and two thousand and one
0:29:05	was it will dialogue capabilities
0:29:11	the advantage for this system was that it could
0:29:14	quality and
0:29:16	a lexicon depending on the task
0:29:20	so for example if you're given a day
0:29:23	that you're interested in an email you could collect all the nine
0:29:27	and subjects for that they so the one who pro
0:29:31	so that they to see
0:29:34	and have an email remotely right to down
0:29:38	there was a error detection
0:29:40	and clarification dialogue
0:29:43	but in addition
0:29:44	there was a test task dependent
0:29:48	what men
0:29:49	so this system did not need any startup training
0:29:54	there were quite a few other systems of this nature at that time and they
0:30:00	also for the mice because they required by to have our training
0:30:05	and very few customers for willing to spend time
0:30:09	this is not important
0:30:10	less than that i will touch and lighter in my conclusion
0:30:18	we talk about
0:30:20	computer voice interface
0:30:25	it was originally conceived as that a lengthening interface
0:30:30	because
0:30:31	if you wanted to probe your
0:30:33	computer remotely a
0:30:36	there was another way to do it in as i said that has disappeared to
0:30:40	the
0:30:43	a margin so smart phones
0:30:47	but it does contribute to ease of use
0:30:51	and especially states to handicap
0:30:56	the mouse in this case is and headed
0:30:59	the mentioned for
0:31:01	multimodal use
0:31:04	but of course one could
0:31:06	also use a gestures
0:31:09	and i tracking care your computer is equipped to do that
0:31:15	it does enhance the interactions
0:31:18	so for example if you're word and excel sheet
0:31:21	a
0:31:22	so out of having to write the formulas you could simply
0:31:26	a
0:31:26	verbalise
0:31:28	without the model by saying average on three or
0:31:32	with a mouse simply point
0:31:34	to the column or with your finger
0:31:37	and say average this call
0:31:42	and finally
0:31:44	robotic command and control
0:31:48	okay a
0:31:50	nelson showed us a at all
0:31:53	but time
0:31:55	few weeks ago hours the visiting my granddaughter and she actually has the story and
0:32:01	this is not
0:32:02	a
0:32:03	i think voice response story it's actually
0:32:07	training by the child and does all sorts of things like set and calm and
0:32:12	you can see
0:32:14	my resilience like no
0:32:22	i'm sure many of your seen in the robotic
0:32:25	would be wildly
0:32:27	which was a garbage one thing
0:32:30	the vice robotic device not voice control
0:32:36	this is a device a used by the military to
0:32:41	explore spaces
0:32:43	and that's use bonds
0:32:46	generally it's not used what's voice control but
0:32:50	activated by joystick
0:32:53	what if
0:32:54	the soldiers not have time to wait for it to explore the space before they
0:32:59	would enter
0:33:00	first control would certainly help
0:33:03	and finally
0:33:05	this is a program run the vault that are
0:33:09	what the strange name a big door
0:33:12	i don't know why it's called big door
0:33:15	you all would probably better because it's meant to carry a
0:33:23	a lot of provision so that the soldiers not too late with a
0:33:29	the weight
0:33:31	and a this particular device can certainly use voice control because it is accompanying this
0:33:37	altar and soldier
0:33:39	needs to remain hands-free and i three to be able to operate
0:33:45	so one
0:33:48	it is found in torrance
0:33:50	and extremely useful for both commercial
0:33:53	and military purposes
0:33:57	big tall as they showed before is a companion to a soldier
0:34:02	and it's the perfect setup
0:34:04	for multi modal communication
0:34:07	because when you have your tonight three it certainly is one more natural to select
0:34:12	a big door
0:34:14	go there and point to it
0:34:16	or
0:34:17	have it fall or your gaze
0:34:20	and
0:34:24	on the other thing that i and added here is
0:34:29	the reporters of multimodal communication
0:34:32	could be found in yours
0:34:34	where
0:34:35	there were about itself
0:34:37	wouldn't use gesture
0:34:38	is direction finder
0:34:44	so
0:34:46	i would like not to address
0:34:50	what i think
0:34:51	is necessary for the future
0:34:55	and
0:34:57	obviously for asr
0:35:00	we still have
0:35:02	a problem
0:35:04	where its robustness to noise
0:35:07	channel conditions
0:35:10	i believe that is being worked on
0:35:16	but there is
0:35:18	and word making problem
0:35:21	a language modeling which prevents the technology from being robust
0:35:26	for topic in general
0:35:29	very often
0:35:31	we train
0:35:33	well lots of data for a specific on the right switch to a different genre
0:35:39	a
0:35:40	the accuracy falls very drastically
0:35:45	so i don't believe that we need
0:35:48	spend a lot of effort
0:35:50	researching language models
0:35:54	and i had the luxury few years ago to
0:35:59	have an experiment done
0:36:02	because i was curious as to
0:36:05	how does computer phone like a phonetic transcription relates to implement phonetic transcription
0:36:13	a most people believe that humans are extremely adapted phonetic transcriptions
0:36:19	and i believe that is because many of the experiments that have been done
0:36:25	in transcribing
0:36:27	phonetic so done in artificial settings and results are much higher
0:36:32	then
0:36:34	should be
0:36:36	so
0:36:37	we ran an experiment where we ask human trends to transcribe speech naturally
0:36:43	except that they have no
0:36:46	lexical semantic or you even phonotactic information
0:36:52	to do that
0:36:53	shows two languages with an extremely similar phoneme set
0:36:58	have one set the native speakers speak one language and have another set of native
0:37:04	speakers
0:37:06	transcribe that in their own language
0:37:08	as best they could
0:37:11	experiment was actually cherry with
0:37:14	and additional language i will surely the results for the first two languages which were
0:37:19	japanese
0:37:20	and italian
0:37:23	which have a tremendous overlap phonemes and as you can see here
0:37:27	i guess are had a
0:37:31	thirty four point nine phone error rate
0:37:34	the average human head twenty nine point nine
0:37:38	the best thing when had seventeen point two but the words
0:37:43	much exceeded the machine
0:37:46	humans have no trouble understanding even thirty seven point five percent
0:37:52	phone error rate
0:37:55	experiment was also done by using
0:37:58	spanish and italian
0:38:00	and of course
0:38:02	there is
0:38:04	quite a bit of phonotactic over a wide and some lexical overlap and the results
0:38:10	for
0:38:11	spanish-italian much higher
0:38:13	but when you're bored of any kind of language models and phonotactic models
0:38:19	obviously
0:38:21	the machines are doing almost as well there is really here for about
0:38:26	fifty percent relative improvement
0:38:30	i might add that the recognizer use the here was not that the neural net
0:38:34	recognizer and we're beginning to see that fifteen percent relative improvement
0:38:41	so
0:38:42	maybe some
0:38:43	the machines well matched the human ability to transcribe
0:38:51	going on
0:38:53	people always talk about
0:38:55	prosodic analysis in asr
0:38:58	but data
0:38:59	so far there has been very little research
0:39:03	it's not important
0:39:05	a for transcription
0:39:07	or one way translation
0:39:10	but it's extremely important for dialogue goes
0:39:14	intent
0:39:15	does drive to dial
0:39:21	those of you who've known me in the past will probably wondering why i didn't
0:39:25	say much about text-to-speech so far
0:39:28	but to
0:39:31	that technology has a really taking a turn
0:39:36	in some respects for the better but in many respects for the words
0:39:40	a
0:39:41	it sounds a lot more natural than it did
0:39:45	in the nineties
0:39:47	because of the
0:39:50	all right hmm models and other large vocabulary large data
0:39:55	synthesis
0:39:57	but presently has
0:39:59	fairly much disappeared from text to speech
0:40:04	again it may not be important
0:40:07	if you're expecting a once and actually spawn
0:40:11	but
0:40:13	if you're trying to listen profile paragraph i guarantee that you will not have much
0:40:18	comprehension
0:40:22	the present to me that he of text-to-speech
0:40:26	still does quality evaluations but as part i know
0:40:30	a they don't too much comprehension evaluation of my cat cup with the community so
0:40:35	i'm not sure but i think it would be who
0:40:38	to do
0:40:40	an experiment which we actually did years ago which it's present a very large complex
0:40:45	paragraph
0:40:46	we attacks the speech
0:40:48	and then do college or like
0:40:51	multi
0:40:53	choice questions and see how much is reading
0:41:01	for these applications
0:41:03	error detection and what localisation is extremely important
0:41:09	i make it
0:41:16	and
0:41:21	my computer had problems here
0:41:23	and we need the dialogue for error recovery
0:41:28	also dialogue for help menu is extremely important to facilitate a
0:41:35	applications
0:41:37	and finally
0:41:39	joint optimization between the asr and their application
0:41:43	a quite often
0:41:45	reduces the error for the application
0:41:48	even if it may increase the word error rate for the asr
0:41:53	and we have seen that
0:41:55	repeatedly and
0:41:56	various programs where we're at the either
0:42:00	transcriptions from speech are transcription from and writing
0:42:05	going to speech translation or joint optimization actually all
0:42:19	we cannot do a
0:42:21	for this community for many of the problems that are preventing
0:42:28	certain applications
0:42:30	to become deplorable
0:42:32	there has to be a lot more work in Q and they and the information
0:42:36	retrieval
0:42:39	there has been working on that but i don't believe that the accuracy is such
0:42:44	that would satisfy
0:42:48	kind of customers that
0:42:51	what call into it
0:42:53	may have it does have a lot of value in
0:42:58	more
0:43:00	type of analysis work
0:43:02	but
0:43:05	we have to have
0:43:07	very blessed false alarm and
0:43:10	a lot more
0:43:14	detection
0:43:15	before we can actually do qualities
0:43:18	and i know that
0:43:21	it's my turned back and we well
0:43:25	we will is the giant and information retrieval
0:43:28	and it does have hundred percent recall
0:43:32	but it also had zero percent precision
0:43:38	and one
0:43:39	should not expect
0:43:41	to get
0:43:42	responses
0:43:45	with zero percent precision can we actually for
0:43:50	doing we had
0:43:54	one aspect of gale walls
0:43:56	a
0:43:57	what we call this relation which was a very different responses
0:44:03	targeted
0:44:05	and
0:44:06	when danced
0:44:07	applying where it was important who they want to one
0:44:13	and we had one such example
0:44:16	or was one more prevalent
0:44:18	for those who
0:44:19	to go down the wound up there who
0:44:23	a the first fifty responses by google were all the reverse
0:44:29	well
0:44:30	the gale distillation was actually able to pick
0:44:33	but
0:44:34	still think that there's a lot more work
0:44:39	again there should be a lot more work done in unrestricted bilingual dialogue
0:44:53	what don
0:44:56	one of the things that
0:44:59	prevent this
0:45:00	technology from going
0:45:02	for
0:45:03	is that there is a need for platforms that one
0:45:08	a lot of the platforms
0:45:10	i haven't done as an experiment
0:45:14	so for example
0:45:17	if you have
0:45:19	hey
0:45:23	dialogue system whatsoever about or with your desktop
0:45:28	whenever it encounters an oov if you can explain that word and habit reading that
0:45:34	or whenever it encounters a construction
0:45:37	that it does not understand
0:45:39	comes back with their clarification dialogue and you can explain it
0:45:43	eventually subsystems would become smarter
0:45:48	and better
0:45:51	systems
0:45:52	should be eventually configured to be able to do planning and inference
0:45:57	and finally
0:46:00	although
0:46:01	just before i probably not well i started the program
0:46:05	and grounded language acquisition for the full a i semantics
0:46:10	that did not go very far but i do believe that there's room
0:46:14	to do a lot of research in this area
0:46:20	with my final slide on trips like for a little about the choice of applications
0:46:27	so when designing an application they should be
0:46:31	customers
0:46:32	a standard applications
0:46:36	applications with too many false alarms
0:46:39	that is
0:46:40	a router with too many
0:46:42	but routes
0:46:44	a
0:46:45	engenders lack of trust by the customer
0:46:49	the number of misses is not that's crucial but it is application dependent
0:46:54	because you can always have sort fail for
0:46:58	when you miss
0:46:59	an action
0:47:02	but it is also important
0:47:04	to reduce the cost of enrolment and the cost of learning i specific application
0:47:12	which is usually done by
0:47:15	a machine itself detecting errors and correct them
0:47:21	it is
0:47:23	is important to design compelling applications some applications maybe
0:47:29	easy to implement
0:47:31	but unless they have an urgent
0:47:33	need
0:47:34	the most likely will fail
0:47:39	it's also always wise to ensure that your application
0:47:44	it is compelling selected as an alternative
0:47:49	way to accomplish the task
0:47:54	five there is alternative again the application
0:47:58	well disappear
0:48:00	and
0:48:01	finally i'd like to and this on a real positive know which is good news
0:48:09	and you're all for a
0:48:12	bill gates this and that speech is the most natural form of communication
0:48:19	and
0:48:20	where actually saying at
0:48:22	speech and multi modality despite
0:48:26	their prevalence of smart phones it's not disappear
0:48:30	and
0:48:32	many of their
0:48:34	internet giants are
0:48:37	investing
0:48:38	heavily
0:48:40	speech technology
0:48:53	any questions
0:49:02	some
0:49:09	someone might mistakenly get the impression from the part where you quoted the hot comparatively
0:49:16	low error rates for the machine and on phones
0:49:20	that there's nothing to be done an acoustic part i don't think you think that
0:49:24	is your the other bullet about noise and reverberation where i think probably to machines
0:49:28	and fail much faster than deep
0:49:33	well i
0:49:34	as i said there's still that fifteen percent
0:49:38	and
0:49:40	no more than that i mean because i don't define it should experiment with the
0:49:44	noise level finish my higher
0:49:49	which buttons so there is still that fifteen percent
0:49:52	but also if you know this one of the humans actually did twice as well
0:49:58	as the machine and there's no reason to assume that the machines can do that
0:50:05	well either so yes there is plenty of room for improvement
0:50:09	and this amount fact there's no reason to assume that the machines can do better
0:50:13	than human
0:50:15	there are many tasks
0:50:17	a specifically a speaker verification where machines are more capable than humans to do it
0:50:25	so i'm not saying that a
0:50:28	as far as noise is concerned i would love to learn the same experiment in
0:50:34	or is that was run for clean speech because i think human phonetic recognition is
0:50:40	in noise will drop weight down
0:50:43	just like the machine a
0:50:45	used to use alternative strategies to be able to transcribe speech they don't just use
0:50:52	the phone set
0:50:54	they have a lot more knowledge which is in the language model of the syntax
0:51:00	semantics
0:51:01	and
0:51:03	yes so there is plenty of room to do research and acoustics
0:51:08	but other parts are really lagging we have been start with n-gram models
0:51:14	okay so we will have done as for translation i don't know about for transcription
0:51:21	seven n-gram models space becomes extremely flat and i have always to use the same
0:51:29	example
0:51:30	if i have a bunch of words followed by the that followed by a lot
0:51:35	of words followed by toward shoe followed by a lot of words provided
0:51:40	word then
0:51:41	they're chew bone are much more compelling that there were really hairy black
0:51:49	while sitting you know outside my challenge
0:51:53	so
0:51:55	yes i think that although
0:51:58	many of my colleagues have assured me that it's been trying to i think it
0:52:01	should be tried again
0:52:03	try to find
0:52:05	that are rolling then what we are using
0:52:11	more
0:52:14	so well most of us here with nist maybe one are in the cycle or
0:52:20	out of their of an R N B psycho in speech technologies
0:52:24	or you probably witness like whole bunch of this cycle so is there something that
0:52:28	surprised you
0:52:30	in the last time something that you basically were not expecting and
0:52:34	okay
0:52:42	i would say that to
0:52:45	in this sense nothing surprise me but
0:52:51	i think the technology is continuing on an upward trend in
0:52:58	all aspects of the technology the language as well as the transcription
0:53:03	the cycles are very long and are
0:53:07	we want to wanna get a break through the use you
0:53:11	that points function and the rest of the time they are incremental
0:53:18	i don't know whenever i discussed this nobody seems to recall it but a full
0:53:24	we gave a
0:53:26	and by that for you there i can support interspeech i don't remember which one
0:53:31	it was but it was in hawaii
0:53:34	where he was
0:53:36	the money in the fact that speech recognition improvements and the nineteen eighty five instances
0:53:42	then all the effort has been in application
0:53:46	i don't really what that observation
0:53:50	but progress is very slow down
0:53:55	where lower
0:53:56	near
0:53:57	the ability to
0:53:59	transcribe and restricted word well all genre in all
0:54:05	or be able to understand
0:54:08	and you
0:54:10	so
0:54:14	might don't really basically consisted all
0:54:17	doable application

Utilization of ASRU technology - present and future

Applications Day

Joseph Olive