Speech Transcript - Supporting Spoken Assistant Systems with a Graphical User Interface that Signals Incremental Understanding and Prediction State

0:00:14	she you not good afternoon
0:00:17	i am casey kennington
0:00:20	currently boise state university but this is work that i did
0:00:24	well i was to build a full university with along with that was long and
0:00:29	and i'm gonna give my two cents on
0:00:31	a continuation i guess on yesterday's discussion on personal assistants
0:00:36	"'cause" we're gonna tell you a little bit about a personal assistant of that we've
0:00:39	been working on
0:00:40	and if you don't know what a personal assistant is your in the wrong conference
0:00:46	you've heard of them you've use them and they're great i mean they their useful
0:00:51	not we dialogue people aren't the only ones using and lay people are using
0:00:55	quite often quite regularly
0:00:58	but
0:01:00	when these laypeople use these
0:01:03	systems
0:01:03	these dialogue systems essentially these personal assistants they do weird things with them and they
0:01:08	complain about mary all things
0:01:11	and so today want to talk about a few of those things and maybe make
0:01:15	a approach addressing a couple of them
0:01:18	one thing is that they kind of have a difficulty signalling affordances someone shorter but
0:01:23	yesterday and things you can do with your e
0:01:25	why doesn't need a book
0:01:28	that you need to disney to be signal somehow and it shows be a lot
0:01:35	of these sure speech recognition output and sometimes it's great perfect
0:01:39	but you know well
0:01:41	that speech recognition even if it is perfect does not you know understand
0:01:46	that something else that needs to happen here
0:01:49	they don't know that understood until it finally does something comes back and the results
0:01:52	are
0:01:53	maybe what they wanted maybe not
0:01:55	another thing is the user has to expressed
0:01:58	express their intended one goal
0:01:59	that you have to say the whole thing wait for to get back to them
0:02:02	and then they can continue wanting
0:02:05	sort of like this again with the system
0:02:08	looking into that a little bit more if you if you consider a
0:02:12	personal assistant on a continuum like there's some one extreme you have these
0:02:17	person or systems that i don't even really want to talk to you
0:02:22	they
0:02:23	want to its apparently easier to predict your life then it is to predict
0:02:28	what you're trying to say and so groove allows trying to do this in this
0:02:31	is useful
0:02:34	on the other side of the continuum you have the full turn
0:02:38	personal assistant that is expecting you to
0:02:40	given entire intent and then it
0:02:42	that was all that's understanding and you do some kind of response maybe there's something
0:02:46	in the middle that would be a little bit nicer
0:02:49	sub-turn little bit a little bit to the left ear so
0:02:52	i say call mom and there's some sort of feedback that it understood be a
0:02:56	and i know that understood me a nice to amend it and then i can
0:02:59	say on speaker phone and okay good
0:03:03	and we can move this may be given a little bit more to the left
0:03:06	and say something call a your mom
0:03:09	one speaker phone
0:03:13	it's
0:03:14	exactly that's what i meant to say
0:03:16	so there's a little bit production it's not trying to predict your entire life it's
0:03:19	allowing it to give at least part of the intent but that's doing some prediction
0:03:22	error so we can maybe make our dialogue systems fit some runs continuum that's useful
0:03:27	for any particular user
0:03:29	we want to look at this a little bit
0:03:32	really quick related work some inspiration joyce tries work on misalignment manners signalling understanding and
0:03:37	others work
0:03:40	on backchannels stuff on arts and
0:03:44	work on goodies which we kind of are gonna do here and then of course
0:03:48	lose project
0:03:50	we would take inspiration from all of these
0:03:52	for some reason they're not none of these people here
0:03:59	but we're gonna do something using all this all of these as a sort of
0:04:03	inspiration so we're gonna signal ongoing understanding
0:04:06	you can agree
0:04:07	assuming here of course that people have a way to display agree so this might
0:04:11	not work on something like the amazon
0:04:13	echo but most people have other phones with them and can use the personal assistant
0:04:18	with the display
0:04:21	and with it with this really backchannels don't overlap speech so for talking and its
0:04:25	updating and showing them its understanding then it's not gonna have any problems importantly works
0:04:30	incrementally
0:04:31	that is word for word are explained that the moment a little bit more and
0:04:34	it works with
0:04:36	minimal or no training data
0:04:41	the rest the talk is as follows i'm gonna explain our system
0:04:44	and the components of it and then
0:04:47	see if that system is worth its salt
0:04:50	well first the system
0:04:53	at first blush looks like any other dialogue system you've ever seen their speech there's
0:04:57	nlu errors dialogue management there's some way to convey the it
0:05:03	i'm response to the user
0:05:04	user with technology in but in this case agree
0:05:07	the speech recognition i'm not gonna going too much it's
0:05:10	google asr we have it modularised here nicely to give us incremental
0:05:15	results so word-byword it's coming back to us and we take the those that incremental
0:05:22	output from the asr give it to our nlu
0:05:25	and are not use working in lockstep with that so one takes a word
0:05:30	and we're gonna use the in the simple incremental update model which we introduced in
0:05:33	sect dial and that's in two thousand thirteen
0:05:36	and without getting technical you can look at the paper if you like
0:05:40	equation thing like that you can if you what you get is you don't word
0:05:45	and its going to produce a distribution over slots
0:05:48	and that's can be given to the dm the dm the dialogue manager gonna use
0:05:52	that somehow
0:05:54	with this little provision when someone utters a word
0:05:57	asr gives us a word
0:05:59	that is the same as more similar to
0:06:02	a value that could fill a candidate slot
0:06:05	then that's gonna get more credit and this is how we are able to make
0:06:09	the system work with little or no training data and then build up from there
0:06:13	that's no you're
0:06:16	but the dialog managers taking these
0:06:19	word for word the not use given this these slot
0:06:23	distributions to dialogue management dialogue manager has to do something with that
0:06:28	though
0:06:29	in fact it's making one of four
0:06:32	there are simple decisions one is
0:06:34	i get a slot a look at its confidence value and what why do i
0:06:38	can wait
0:06:39	if it's if the confidence values well just sort of ignore it
0:06:43	in particular so particular value isn't enough to make the slot the one that i
0:06:47	want
0:06:49	or i can select something
0:06:51	is above some confidence threshold than the slot as good let's fill it with this
0:06:54	value
0:06:55	or to others here is we're close to that threshold
0:06:58	but not quite there so let's make a clarification request and somehow display that agree
0:07:05	and then of course they have to be able to confirm that request
0:07:08	i want to point out here that it is here between sort of the nlu
0:07:12	on the dialogue manager
0:07:14	where this and pointing is done we're not doing and pointing with speech recognition that's
0:07:18	just always on
0:07:20	and it's here that where
0:07:22	so they can stop and pause and think and what do something it'll wait for
0:07:25	them to finish so they can do things in instalments so it sort of semantic
0:07:28	driven and pointing
0:07:30	and we can use of and i'll
0:07:32	for this it's sort of rulebased at the moment but we have the provisions are
0:07:36	there now for
0:07:38	reinforcement learning and learning on-line to improve the system as people interact with it
0:07:43	now we do we
0:07:45	the dialogue manager decides which was to be filled and it says gui here's what
0:07:50	the decision i've made please convey this information to the user
0:07:53	and the golay you'll notice right off the bat we aren't
0:07:58	obviously aren't you are designers
0:08:01	but here's the here is that you turn the system on and
0:08:04	this comes up it's in java script so
0:08:07	and it just looks like a right branching tree and really that's all it is
0:08:10	but right here you can already see what the importance as r o we can
0:08:13	do these five things are nice
0:08:14	i don't have to guess i'd have to play with it in figure out what
0:08:17	it knows and what it doesn't know
0:08:19	and so i look at this thing is a well you know i am kinda
0:08:22	hungry and it will go then into the food domain and sort of open up
0:08:26	the treatments a lot
0:08:28	if you if you're hungry then i
0:08:30	you know one where you want you know what you want and where
0:08:34	you're gonna unit
0:08:35	and i can say you know i'm among we first and thai food and at
0:08:38	that point in
0:08:40	go to the top here and
0:08:43	shoulders note and read a question mark for this clarification state did you say tie
0:08:47	in to the and this to me as
0:08:50	into it in that it
0:08:52	is trying to understand me and i have to do is say yes or i
0:08:55	mean time and that would fit
0:08:56	basically feel that slot which
0:09:00	conveyed visually means that it just collapses that are the tree and shows like this
0:09:03	so the here's a here's a frame that is filled
0:09:06	and it shown visually like this
0:09:11	that's our system
0:09:13	recall right
0:09:16	now well we did some experiments to see if that's system it was everything we
0:09:21	hoped it would be and where to put some people in front of it
0:09:25	though
0:09:27	we want to test a couple of things about this system so we're gonna break
0:09:31	it up in the basically for different
0:09:34	different settings
0:09:36	we want to test
0:09:38	we want to see if our incremental system is better than or more useful i
0:09:42	suppose than the traditional one
0:09:46	so we're gonna let them play with that of first and give them a trial
0:09:49	phase here's our system here some tasks to do them and get used to the
0:09:52	interface and then we're gonna
0:09:55	sort of move start on the very right side of the continue one where they're
0:09:58	doing this
0:09:59	traditional
0:10:02	current turn taking full fully intend mentioning
0:10:11	personal assistant
0:10:12	so and points
0:10:14	as usual
0:10:15	kind of like the traditional personal system
0:10:17	so we then we
0:10:19	then move the continuum move on the continuum a little bit to the left and
0:10:23	nouns incremental now we're doing some terms
0:10:25	and you can
0:10:29	do things in instalments
0:10:31	and then we have phase three for removing that
0:10:34	a little bit more to the left on a continuum answering
0:10:37	now it's going to adapt to you a little bit and try to predicts and
0:10:41	fill some these slots for you
0:10:44	or expanded a little bit phase one acted like a standard personal assistant silence and
0:10:48	pointing before they can we would even show and the asr was shown like it
0:10:52	is in your standard personal system
0:10:55	based to is incremental phase so they did phase one for four minutes
0:10:59	and then they began face-to-face to did not display asr is just the query and
0:11:03	it just was always there are showing always updating
0:11:07	and the endpointing as i mentioned was done semantically
0:11:11	s two and determine there was a question and we just asked them you know
0:11:14	what you think about
0:11:16	these different systems so there was a ten questions and we ask some you know
0:11:20	that they prefer the first system the second system either or both
0:11:24	and case three started this was the adaptability adaptive phase
0:11:29	which is basically the same as face to with adaptation and the wayward is that's
0:11:33	very simple way
0:11:35	if base
0:11:36	if they did it task
0:11:38	basically build a slot or frame
0:11:41	and they
0:11:42	did that same thing again it will remember it and start to
0:11:45	ask them just immediately ask a clarification so instead of saying i want this i
0:11:50	want the thai food they would say i'm hungry and then it would say then
0:11:53	it just have to say yes and it was shown slots for them
0:11:56	and then after three times we just filling all the frame entirely for
0:12:00	and also an example of that much for video card movement
0:12:03	and then after face three we had another questionnaire that compared phases two three
0:12:08	so here's that video
0:12:10	so this is in german i'm doing this
0:12:13	so if you speak your mind you apologise from my accent and so anyway so
0:12:18	i'm saying something like this i'm hungry us i want to eat something around here
0:12:22	maybe thai food
0:12:23	and it does a clarifications are to say exactly
0:12:26	and then i repeat this several times to show you the adaptability of this
0:12:30	this isn't something you would do you're not gonna take your personal assistant read be
0:12:34	yourself five times
0:12:36	it's gonna give us a lot
0:12:38	but just to show the functionality of this
0:12:45	stress
0:12:49	are
0:12:52	i
0:12:54	so
0:12:56	it's filter not just one more kiss
0:13:00	we are hungry and now it's also
0:13:03	i feel like
0:13:05	and i don't see that same thing i am hungry
0:13:09	so
0:13:16	and then the last time i said calmly
0:13:19	if someone else
0:13:25	i'm a pretty
0:13:27	pretty easy going to predict yes but this is common
0:13:30	it will use their people want to use these personal assistance data the same thing
0:13:33	over and over again
0:13:35	my brother here's an act my brother everyday twice a day all opens up as
0:13:40	i phone subspace yuri
0:13:42	google voice you traffic
0:13:44	every day
0:13:45	is it just like that and it gets the response he once in people do
0:13:49	this and it could probably just pop up and shown the traffic
0:13:54	where am here
0:13:55	so we got fourteen participants to come and sit down with our system so we
0:14:00	set them data at a table there is a
0:14:02	a screen that show the task that they were to do not spend a moment
0:14:05	and then there is a chat with it was a turn on its side it
0:14:08	shows the gui and the gooey was this was as i showed you and it's
0:14:13	it's javascript so it was in a in a web browser basically a motel what
0:14:16	and then as a keyboard push a button to let them know that they couldn't
0:14:19	one
0:14:21	but to signal about that the task was complete rather so the tasks were like
0:14:26	this there are five possible tasks call reminder
0:14:29	find a restaurant leave a message or find a route between two cities
0:14:34	and that asks questions icons and the task items were randomly chosen randomly chosen task
0:14:39	randomly chose the slot so we want them to convey to the system and then
0:14:43	there is a fifty percent chance later that the task would be repeated
0:14:48	here's an example
0:14:49	they were said they'd be sitting down playing with this the system and then something
0:14:53	like this would pop up on the screen and that thousand or call
0:14:56	peter
0:14:57	and the system with then
0:15:00	due to its magic then show
0:15:03	google really show it's gooey and once they
0:15:06	recognise that understood then they would push a button and a new task pop up
0:15:12	and they were charged with doing so many of these task as possible
0:15:15	because the we wanted to do this
0:15:19	and not just let him play with it because the tasks
0:15:22	help us
0:15:25	collect some objective measures as well if we tell them we want them to do
0:15:28	is many tasks as possible in the four minutes of to have to interact with
0:15:32	each setting of the system then we can learn a little bit more about how
0:15:35	productive they work
0:15:37	so here's the other tasks they would see stuff like this
0:15:39	so we have the twenty most common german names you know how to most published
0:15:43	cities in germany billfold it turns out as among them
0:15:48	and you know everything else part of the so there's quite a few possibilities that
0:15:52	could be said here
0:15:54	but again
0:15:55	we didn't train this at all we just sort of type these and got a
0:15:58	list of stuff and threw it into to the system important that was the end
0:16:01	of it and then worked
0:16:05	but here some results from the questionnaire as we get we can we can conclude
0:16:09	the following based on sums some significance courses that they generally like the gucci
0:16:15	they counterintuitive to use an easy and understandable
0:16:18	and that was our main focus now something goal
0:16:22	the grill optimistic to be taken care of locally and they did this a lot
0:16:26	if a mistake if the if of slot was filled with the wrong thing they
0:16:29	would immediately try to fix it
0:16:31	it didn't always just push a button move on to the next task or
0:16:34	there is a keyword they could say that could we start from the beginning they
0:16:37	generally trying to fix it right there and it was able to do it for
0:16:40	the most the time
0:16:42	and they didn't generally notice that the between face to face three the incremental and
0:16:47	adaptive phase they didn't really know there's
0:16:48	something adapting but for those who did not which was about half of them they
0:16:52	notice that was face three nineveh did get wrong and there's a listing of all
0:16:56	the questions and there's more in the in the results section of the paper on
0:16:58	this because of the
0:17:00	this is what some things we want to highlight from that
0:17:04	so
0:17:05	the objective results we are these tell in interesting story so we just cut we
0:17:10	just kinda that the number of tasks of their able to do in the different
0:17:14	settings
0:17:15	and once they get increments one adaptive variable to do quite a few more tasks
0:17:19	at least they thought the tasks were complete
0:17:22	and here the next the next rows frame accuracies so when all the slots in
0:17:26	the framework the same as the one that we wanted them to convey in the
0:17:30	task that we showed
0:17:32	and the adaptive wanna
0:17:33	does quite well because basis it's part of the time the slots are already field
0:17:38	for them
0:17:39	so it score one for google now
0:17:41	i guess trying to predict your life is actually maybe easier than learning how to
0:17:45	understand language
0:17:48	the other to tell an interesting the more interesting story we get f-score which is
0:17:52	basically maybe the entire frame wasn't correct but the this gives a and idea of
0:17:58	the correctness of the slots of the frame maybe wanted to the slots were correct
0:18:01	one wasn't
0:18:03	and
0:18:04	in this case incremental lower and then look at the time the time is about
0:18:08	the same across all and this tells us that the degree was
0:18:12	intuitive enough that in the in the printed
0:18:15	phase where they are just playing with it in the trial phase
0:18:19	they learn enough about an experience enough that they are just getting used to it
0:18:22	over time
0:18:26	and
0:18:28	what both these rules tell kind of that story
0:18:31	so it helps to be a little bit more productive especially in the adaptive the
0:18:34	adaptive
0:18:36	ending
0:18:37	so they're kinda nice results not the most stellar thing this thing is and you
0:18:42	know going to be in everyone's phone next month
0:18:46	but
0:18:47	like i said we didn't use any training data and it was fairly robust
0:18:55	some discussion here
0:18:57	our incremental personal assistant or ip a different i suppose allow users to make mistakes
0:19:02	easier and sooner allow the users to interpret the state of the system's understanding
0:19:08	and under the adaptive settings it allows users to be more productive you get more
0:19:12	tasks done in this kind of the setting where we're driving them to do tasks
0:19:16	like this
0:19:17	and endpointed based on semantics not based on site
0:19:20	i have a nice thing
0:19:23	future work
0:19:27	i mandarin is the obvious thing we have a system no training data let's interact
0:19:31	with it and it should start to learn and do things better
0:19:34	and the mechanisms of their siam the nlu model we have the dialogue manager we
0:19:39	have all have provisions for this we just need some kind of a supervision signal
0:19:42	which we have if the frames filament get sent on their happy with that
0:19:46	we can give feedback now to say those utterances led to this then that should
0:19:50	that should help the nlu and hope that the dialogue manager work better
0:19:53	same for additive
0:19:55	and better use user modelling and adaptability
0:19:58	like to be improved
0:20:00	also web based authoring loose does this a lot of systems other that do this
0:20:04	right now it's not too bad you can after adjacent file and it'll important there's
0:20:08	tools for that and is actually fairly quick and easy but where they softly might
0:20:11	be nice and then of course we need to scale up to more
0:20:15	larger domains degrees the bottleneck here and it's sort of a two edged sword you
0:20:18	wanna show your stuff but also be able to handle lots and lots of general
0:20:23	things so
0:20:25	that is it thank you
0:20:33	note that focus on
0:20:37	if the
0:20:51	right
0:20:58	right like a like i said we're not ui
0:21:02	experts bring us to if you're right it's gives call i guess on but what
0:21:05	we have right now is sort of a max after their seven or eight knows
0:21:09	that is just sort of dot the thing you have to do there is there
0:21:13	and what gets shown what are the top seven that you will show and if
0:21:17	those are if there's something that's not english on their then you doing something wrong
0:21:21	so there's more user modelling that happens in that regard what get shown on the
0:21:24	gui
0:21:26	better no you would help with that
0:21:29	better user model and help with the
0:21:31	good question
0:21:33	research future stuff
0:21:35	i q
0:21:47	right that i'm not that the future work i mean the way we don't the
0:21:51	provisions are there are also in this you can you can click on of the
0:21:54	clicking doesn't do anything about the idea is kind of like the stuff on larson
0:21:57	to his gui as you can talk about the gui itself and navigate to go
0:22:01	insane know why don't want any of those go down a little bit we start
0:22:04	right there are some exactly
0:22:06	exactly so you can flip through it put stuff and you can add something if
0:22:10	it's not there that would be nice to and i guess but right and system
0:22:12	in as becomes intent that you can use in the future the gui should be
0:22:15	able to help with that
0:22:18	okay
0:22:38	right so it
0:22:42	right so the common question comment was on the semantic endpointing bit of it i
0:22:49	something to look at i don't have
0:22:52	don't have an answer
0:22:54	definitely something considering
0:23:05	right
0:23:06	agree
0:23:19	no not
0:23:20	i want to be really clear on that they're in the trial phase maybe they've
0:23:24	done all the adapting they're done adapting but the system is so rudimentary and simple
0:23:30	and the gui is that it doesn't it doesn't do much you know there's only
0:23:34	a couple of things that it that it does they learn about a very quickly
0:23:38	that's why that time to really change
0:23:40	you know the average time per
0:23:43	for task
0:23:45	so they weren't just
0:23:46	getting used to it over time because they are already used to before they even
0:23:49	started the first phase that's kind of the taken thing i got from the objective
0:23:53	scores
0:23:55	that's something we were concerned with that's why we designed it this way
0:23:59	that was i need i knew someone asks a question i'm glad somebody did exactly
0:24:03	we because of the way we wanted to do the comparisons we wanted to do
0:24:08	this objective comparisons and we wanted to do some objective scores and this was a
0:24:12	debate we had what we ended up doing it this way with the hope that
0:24:14	if we designed the right way
0:24:16	you don't get used to write beginning we will have as facts and the numbers
0:24:20	can show that
0:24:22	i'm glad you ask that

Supporting Spoken Assistant Systems with a Graphical User Interface that Signals Incremental Understanding and Prediction State

Oral session 4: Incremental processing

Casey Kennington and David Schlangen