Speech Transcript - Influence of Time and Risk on Response Acceptability in a Simple Spoken Dialogue System

0:00:21	so i make speaker will be included common
0:00:25	and she'll be talking about the influence of time and risk and was a response
0:00:30	acceptability in a simple spoken dialogue system
0:01:05	okay so this is worse than we'd and e
0:01:09	and the you know why am
0:01:14	and now it
0:01:15	well
0:01:16	that doesn't want to
0:01:17	cool
0:01:18	that works
0:01:20	okay so
0:01:21	what are we doing here
0:01:24	evaluations of dialogue systems are often based on ratings
0:01:31	however
0:01:33	if you look at research in recommender systems you will see the people's ratings are
0:01:39	inconsistent over time and that leads to what it's called the magic barrier you can
0:01:44	only get the certain point in accuracy due to people's inconsistencies
0:01:49	so
0:01:50	we ask ourselves
0:01:52	is this true for dialogue systems
0:01:56	and of course this is implications about the reliability of the evaluations of systems and
0:02:02	about comparative evaluations between systems
0:02:06	and
0:02:07	while we were at the end we also wanted to check the effect of situation
0:02:11	a to rescore on how people view the responses of
0:02:16	a dialogue system
0:02:20	so
0:02:21	we did an experiment we conducted a longitudinal study the dis over time
0:02:28	and in the context of a spoken dialogue system for the household robot
0:02:33	and
0:02:34	the corpus that we use
0:02:36	while as a core pause
0:02:39	for spoken request
0:02:42	the task of robot too fate remove objects in a room
0:02:46	and this study well as in two stages
0:02:50	one of the reviewers of the paper call this heroic thank you
0:02:54	and in the first stage
0:02:57	people selected how they would response respond to request
0:03:01	and have their
0:03:02	a here not have to yes
0:03:05	we gave people the wrong responses in other responses and ask them to rate doubles
0:03:11	responses
0:03:15	so the questions that we want to answer
0:03:21	how well the participants like their stage one response types and we call them response
0:03:28	type rather than dialogue acts
0:03:31	because one of the response types could be just do what you are
0:03:37	that's not the dialogue act
0:03:40	the user the users prefer their stage one response types to have a response type
0:03:47	and three again the situation that are risk because well
0:03:52	it was something we were interested in
0:03:56	so the first thing let's describe the corpus
0:04:00	at the corpus was created in the past what we were developing our system we
0:04:06	had thirty five participants that describe twelve object
0:04:10	in different images we had a total of
0:04:13	four hundred and seventy eight descriptions because people were allowed repetitions
0:04:19	asr performance this is google now
0:04:23	a bit worse than what
0:04:25	you would think
0:04:28	so word error rate thirteen percent that top ranked interpretation was wrong you know about
0:04:33	half the cases and all interpretations were wrong in about a third of the cases
0:04:39	some of the wrong things where little things like a or and
0:04:43	and that was thirteen percent of the cases
0:04:47	we retained
0:04:48	two hundred and ninety two descriptions wise sort of you
0:04:55	some of them there was inconsistency in rating like some people rated only stage one
0:05:00	out of rate that only stage two so we couldn't keep them others are system
0:05:04	couldn't brawl says
0:05:07	and there's head more than one prepositional phrase and
0:05:13	we can process goals but i will
0:05:16	explain later why we got rid of them
0:05:19	so each of those nine and two hundred and ninety two descriptions and a head
0:05:25	for
0:05:26	dot for asr output
0:05:30	and okay let's go back a set gone
0:05:33	why don't for we want that the party c-band
0:05:38	do you hear called uncle why this
0:05:41	spoken language understanding system is hearing which is the output of the asr
0:05:49	and then we to guard descriptions as i said that were generated in the context
0:05:54	of another study
0:05:56	and
0:05:57	prepended get or move to each asr output to turn them into recording
0:06:05	then this corpus was divided into sets of at most well for what
0:06:10	one pair of g
0:06:11	so let's all those of you were referred me before will have single speeches
0:06:17	so party c-band whereas to designate one of the objects a b or c like
0:06:23	eventually all three of them but one at a time
0:06:26	so in this case the participant is describing the hard disk under the table
0:06:34	this is what the asr heard
0:06:36	none of them is correct this is true asr output
0:06:40	and then we put they did in front
0:06:43	so get
0:06:44	that thing
0:06:46	we in the second image again the party c-band once the of the ball farther
0:06:52	away from the plate
0:06:55	which object they have
0:06:57	this is what they it's not hard
0:07:00	and again we add
0:07:02	the get them
0:07:03	and this time one of the interpretations
0:07:06	he's
0:07:07	correct yes the first one
0:07:10	these results are deemed edge
0:07:12	the plate in the middle of the table
0:07:15	so we play the same game can speed up now
0:07:18	and are finally manage the cleanable crack yes that's what they set
0:07:24	and
0:07:25	again
0:07:27	this is what they aside and hard
0:07:29	and this time would
0:07:30	do move why because it's a big object
0:07:33	we cannot ask anybody to get the bookcase
0:07:37	okay
0:07:38	now
0:07:39	this is a we collected our corpus and now we start we
0:07:43	the trial stage one
0:07:45	we collected demographic information gender english native in is whether that are native english speaker
0:07:53	age education
0:07:55	and we also corrected risk propane see the information because we are interested in the
0:08:01	effect of risk
0:08:04	so we collected these from work firearm and that s
0:08:08	six weeks
0:08:09	where is probably i
0:08:10	statements such as i follow the motto nothing ventured nothing getting
0:08:15	and six
0:08:16	risk of version statement my decision errors are always made on their carefully inaccurately and
0:08:22	there are six of each and we measure the agreement or now one to five
0:08:26	likert scale
0:08:29	so
0:08:30	these are our demographic characteristics of
0:08:34	in the stage one we had forty participants six of those were not reachable in
0:08:39	stage two so we are thirty four people
0:08:41	seventeen female seventeen male eighteen native english speakers sixteen on a leave
0:08:47	and these are the age and education
0:08:51	brought five
0:08:56	error for risk prone as just to give you an idea about the human condition
0:09:02	we subtract the
0:09:03	risk aversion from risk brown is so the sum of
0:09:07	all their scores
0:09:09	and this is what are pub population looks like they seem to be a more
0:09:13	recent prone then
0:09:15	risk of ours
0:09:19	so now
0:09:20	now we get to the real stage one
0:09:23	so as i said each participant was shown
0:09:26	the top for asr output for each request maximum twelve requests one for image one
0:09:32	pair i them you in each image
0:09:36	and they were shown versions of the images were all the objects are number
0:09:41	why because they could
0:09:43	peak any object to talk to
0:09:45	to respond
0:09:47	we had to be reached conditions low and high we told them that in the
0:09:51	lower it rests condition the respond there is in the same room as the requester
0:09:57	in the high risk condition the respond that is far away and it will be
0:10:01	in car a lot of inconvenience if they do the wrong thing
0:10:07	and
0:10:08	they had four response types to
0:10:11	choose from and the
0:10:13	they got explanations of what
0:10:14	each response main
0:10:16	in fact they only got these side
0:10:19	this it is for us
0:10:21	so
0:10:23	do means would you just fitch object number
0:10:28	and put the number of the object you would fix
0:10:31	com four i'm ease
0:10:33	you want to last did you mean object again object number
0:10:37	choose which object did you mean
0:10:39	even list of object and rephrase ease i can hear you
0:10:46	i want you to restate so they had four response types to choose from
0:10:51	so this is a sample items so now we see the same room we so
0:10:54	before
0:10:55	but all the objects are numbered
0:10:59	and this is what the survey looks like soul
0:11:03	you may have the four out bolts assuming that you are in the same room
0:11:07	of the speaker
0:11:09	select one of the responses
0:11:12	get object number did you mean object which object did you mean and for rephrase
0:11:18	we actually gave them the option
0:11:21	to say rephrase the object rephrase the position or rephrase the whole sentence
0:11:29	now we distinguish because the asr makes most of the errors on the object not
0:11:34	on the location
0:11:36	and then
0:11:37	we went assume that
0:11:39	this peak at seen a remote location would you change your hands
0:11:43	and we asked the same coast
0:11:48	so
0:11:49	after stage one we got city corpora
0:11:53	one
0:11:55	so we had
0:11:55	five hundred and eighty four responses so
0:11:59	two hundred and i two request standard to race conditions
0:12:02	and
0:12:04	it will become clear why we have to be corpora so the first one he's
0:12:07	response corpus
0:12:11	response corpus he's what answers we got from our parties what
0:12:15	we see
0:12:18	okay what answers we got from our participants
0:12:22	and this is the distribution of the answers and their the law and their high
0:12:27	risk conditions
0:12:28	so do is clearly majority class
0:12:31	and we have come farm choose rephrase and as you can see
0:12:37	the
0:12:39	there is
0:12:39	let's do those in more conferencing chooses
0:12:43	and rephrases and their high risk condition
0:12:47	in addition we developed
0:12:50	two corpora
0:12:51	or dark or pause and classifier corpus so what is a dark or both
0:12:56	and the
0:12:58	responded to every c
0:13:01	why did we want double talk or both because there is a lot of the
0:13:04	variability between people and we wanted to see how user variability affix
0:13:11	the result
0:13:13	and the either in the final corpus is called classifier corpus
0:13:18	and what we need ease we train the classifier
0:13:23	two
0:13:23	select responses based on the
0:13:29	based both on all of our corpus and on response corpus
0:13:36	or and i promised i would then yielded sorry
0:13:39	so this is why we throughout the
0:13:43	requests with more than one prepositional phrase because we wanted to restrict the features that
0:13:48	we used for training the classifier because we just want to the simple classifier
0:13:54	okay so
0:13:56	what does not response classifier look so that look like it assumes that
0:14:01	we have a spoken language understanding system that with don's ranked interpretations
0:14:07	we have to be types of classification features the asr confidence in the correctness of
0:14:13	its own outputs
0:14:15	how well an interpretation matches the description
0:14:20	the risk of the situation and for response corpus we also have the more graphic
0:14:24	and respect propensity information
0:14:28	so
0:14:29	i think weak example this is a close up of one of the rooms
0:14:34	the description is the browns to linear the table
0:14:37	so
0:14:39	these two stools match well the description
0:14:42	the one
0:14:47	the one over there is a bit closer but their balls
0:14:51	are pretty good match
0:14:57	what about the classes so how the classifier do
0:15:01	we tested the whole bunch of classifiers and random forest one
0:15:06	now
0:15:09	these them only the main thing to note is
0:15:13	the bottom line of course ware doing better or and this score pause then on
0:15:19	the corpus of older people
0:15:20	why because there was a lot of variability in responses and their the exact same
0:15:25	conditions
0:15:29	but this is just
0:15:31	before you think i'm wasting your time
0:15:34	and this is not important for the purposes of this paper
0:15:40	so now
0:15:41	we proceed to experiment two
0:15:44	a year not have to two years later
0:15:47	so
0:15:48	each party c-band is shown
0:15:51	the same asr output this in images as in stage one
0:15:56	to race conditions again
0:15:59	and
0:16:00	a bunch of candidate responses
0:16:03	sourced from
0:16:06	the response type in response corpus for the wrong responses
0:16:12	and these responses
0:16:14	the response to speak by the classifier
0:16:16	and also
0:16:18	do confirm pairs so whenever one of these responses what to do if there was
0:16:24	no pun firm in that above three we are that the con four
0:16:28	similarly
0:16:30	if one of these was to confirm and there was no do
0:16:33	we added to do
0:16:34	of course we didn't repeat
0:16:37	several of these chose the same response we present to be done you want
0:16:44	now we had some
0:16:46	it's more challenges do and rephrase that direct renditions of the selections in stage one
0:16:54	but for confirming choose
0:16:57	we needed to do some instantiation
0:17:00	so for choose we chose the pictorially query value and two point d so we
0:17:06	would say is this what you want
0:17:09	in this is your confirmation the particular plate
0:17:13	four choose we had two options there are two plates on the table
0:17:18	and then
0:17:19	we presented
0:17:21	what was
0:17:22	which one do you want or do you want this or that
0:17:26	now
0:17:27	the pictorial version was restricted to only two or three options
0:17:32	if there was more options in the least
0:17:34	i mean nobody says these sort be sort of this or that
0:17:39	it's usually t c
0:17:40	i
0:17:43	and this is what the survey looks like again we have the same age
0:17:48	we have the output
0:17:50	and
0:17:51	now they get to choose between all these responses
0:17:55	and they get to rate them on
0:17:58	a likert scale be on u w t
0:18:05	again
0:18:08	okay going back to her question so how did we do
0:18:14	but this depends rating of the stage one responses are significantly lower
0:18:20	then the rating sets guide to this response types and their both wrists conditions what
0:18:25	do you mean f-score i
0:18:27	if you recall in stage one
0:18:30	they had to pick a response how would you respond
0:18:33	so we said okay
0:18:36	we in order to account for rate thereby s
0:18:39	we will say okay the one d p d is the rnn-based opinion of them
0:18:43	set of saw his their highest opinion of anything was if five
0:18:48	we have scribe to the response of five if it was a four ascribing to
0:18:52	four
0:18:53	but the rating was significantly lower well
0:18:58	and
0:19:00	these are this is still gram present the difference in the rating between
0:19:07	they're ascribed responses and their stage two ratings
0:19:12	so for a lot of them
0:19:15	they kept
0:19:16	so whatever we have scribe the also fold it was pretty goal
0:19:20	but
0:19:22	for quite a lot of them like to
0:19:25	hundred and thirty three for low risk and hundred and sixty nine for high risk
0:19:31	they see new fig on the reduce the rate
0:19:38	question tool
0:19:41	do participants preferred the stage one response type at the response type
0:19:46	in the paper we have balls and the and the classifier
0:19:50	here i'm only showing the classifier why the classifier the version of the classifier that
0:19:56	while using is the one trained on and he was not even trained on the
0:20:00	users
0:20:01	so what did we do we took
0:20:04	we to call their responses that
0:20:07	are
0:20:08	different
0:20:10	between stage two one stage one and then checked
0:20:13	the rate
0:20:15	so
0:20:15	only different response
0:20:18	so in a lot of cases
0:20:21	stage one was better than the classifier
0:20:24	in quite a few cases they were the same and
0:20:28	in enough cases
0:20:32	the classifier that is trained on somebody else did better than their own pretty of
0:20:37	yourself
0:20:40	so this is an example
0:20:44	what to get
0:20:45	and saying stage one
0:20:48	the user
0:20:49	we choose
0:20:51	but then in stage two we give choose a rating of one and come from
0:20:55	a rating of fine
0:21:00	but having said that
0:21:03	at the end of the day
0:21:05	participants rating of their stage one response types
0:21:09	is not statistically significant difference from the rating of different response types and their bowls
0:21:16	race conditions
0:21:18	so i need singles basically
0:21:22	influence on race just quickly
0:21:25	people were more conservative and their high risk which is that's expect that fewer doles
0:21:33	effect of risk on specific response times
0:21:36	so do and choose receive lower ratings and then i raised
0:21:40	and come from and rephrase what unaffected by risk
0:21:46	regardless of race
0:21:47	people rated confirm higher than do and choose with pictures higher than choose
0:21:53	text only
0:21:56	so
0:21:57	to conclude
0:22:00	people's preferences are
0:22:02	fluid over time
0:22:04	various reasonable responses may be acceptable and as we saw a classifier that trained on
0:22:11	a small non-target
0:22:13	corpus produce find responses
0:22:17	recently influences people studied used to with some response time
0:22:22	and what does that mean
0:22:24	well this has implications for training and evaluating dialog systems but this was in a
0:22:30	restricted set been wonderful dialogues would
0:22:33	the pretend robot
0:22:34	so more studies are required
0:22:37	i
0:22:44	we have some time for questions
0:22:52	thanks it's a and very interesting experiment to
0:22:56	and i think it does show clearly that there's some variation in response permitted which
0:23:03	we see another experiments to i'm not i'm not sure how you come to the
0:23:07	conclusion that the users are fluid through time
0:23:12	given that you're you tell you actually asking do something different like rating their response
0:23:16	rating response as opposed to choosing responses a different task
0:23:20	and if you assume that
0:23:22	users don't have just a fixed choice of mine bits of kind of a probability
0:23:25	distribution or utility distribution and you're forcing a choice so they pick one and if
0:23:30	you sampled again from the same distribution you'd expect a certain amount of variation so
0:23:36	is it really that users are changing over time or that you're the rolling the
0:23:40	dice and you get a
0:23:41	a different number sometimes the second time
0:23:44	yes this is a limitation we spot the that one
0:23:49	well or we can assume he's
0:23:52	yes whatever the actual
0:23:54	they must have a the reason for choosing need then
0:23:57	they thought they were making perfect sense
0:24:00	and then you and they were given the exact same options and then in
0:24:05	in rate of pay
0:24:07	there were okay with other options that's i mean
0:24:10	or what i mean
0:24:12	to me that e d case louis
0:24:14	should we have done the experiment differently in retrospect
0:24:18	yes probably but
0:24:20	to the intention the original intention of the experiment
0:24:24	was not to do this longitudinal study we kind of stumbled upon
0:24:28	the longitudinal part
0:24:31	but the okay to ask this indicates that the
0:24:36	you know things are not that is
0:24:38	cut and dry is
0:24:40	a lot of people believe that
0:24:42	they are in anything reasonable goals
0:24:47	we have time for another question
0:24:57	can you go back to select twenty four actually think
0:25:01	wow
0:25:02	the idea to fix the number in my head otherwise
0:25:06	i couldn't mm
0:25:09	there was the conclusion not so much a graph
0:25:19	oops
0:25:19	the next one
0:25:24	it doesn't one
0:25:30	sorry i had a hard time
0:25:32	following the reasoning here did you didn't you just show us that it is only
0:25:37	it was different no i sold there were differences
0:25:41	yes over or when you come when you do pairwise comparison along with statistical significance
0:25:48	testing was no
0:25:51	so although it up here sometimes this wean sometimes that queens
0:25:57	when you do
0:25:59	there might bear it's not statistically significant at all
0:26:05	we didn't wilcoxon signed-rank
0:26:08	yes
0:26:10	who
0:26:11	alright let's think the speaker is again

Influence of Time and Risk on Response Acceptability in a Simple Spoken Dialogue System

Oral Session 6: Evaluation and Data

Andisheh Partovi and Ingrid Zukerman