so i make speaker will be included common

and she'll be talking about the influence of time and risk and was a response

acceptability in a simple spoken dialogue system

okay so this is worse than we'd and e

and the you know why am

and now it


that doesn't want to


that works

okay so

what are we doing here

evaluations of dialogue systems are often based on ratings


if you look at research in recommender systems you will see the people's ratings are

inconsistent over time and that leads to what it's called the magic barrier you can

only get the certain point in accuracy due to people's inconsistencies


we ask ourselves

is this true for dialogue systems

and of course this is implications about the reliability of the evaluations of systems and

about comparative evaluations between systems


while we were at the end we also wanted to check the effect of situation

a to rescore on how people view the responses of

a dialogue system


we did an experiment we conducted a longitudinal study the dis over time

and in the context of a spoken dialogue system for the household robot


the corpus that we use

while as a core pause

for spoken request

the task of robot too fate remove objects in a room

and this study well as in two stages

one of the reviewers of the paper call this heroic thank you

and in the first stage

people selected how they would response respond to request

and have their

a here not have to yes

we gave people the wrong responses in other responses and ask them to rate doubles


so the questions that we want to answer

how well the participants like their stage one response types and we call them response

type rather than dialogue acts

because one of the response types could be just do what you are

that's not the dialogue act

the user the users prefer their stage one response types to have a response type

and three again the situation that are risk because well

it was something we were interested in

so the first thing let's describe the corpus

at the corpus was created in the past what we were developing our system we

had thirty five participants that describe twelve object

in different images we had a total of

four hundred and seventy eight descriptions because people were allowed repetitions

asr performance this is google now

a bit worse than what

you would think

so word error rate thirteen percent that top ranked interpretation was wrong you know about

half the cases and all interpretations were wrong in about a third of the cases

some of the wrong things where little things like a or and

and that was thirteen percent of the cases

we retained

two hundred and ninety two descriptions wise sort of you

some of them there was inconsistency in rating like some people rated only stage one

out of rate that only stage two so we couldn't keep them others are system

couldn't brawl says

and there's head more than one prepositional phrase and

we can process goals but i will

explain later why we got rid of them

so each of those nine and two hundred and ninety two descriptions and a head


dot for asr output

and okay let's go back a set gone

why don't for we want that the party c-band

do you hear called uncle why this

spoken language understanding system is hearing which is the output of the asr

and then we to guard descriptions as i said that were generated in the context

of another study


prepended get or move to each asr output to turn them into recording

then this corpus was divided into sets of at most well for what

one pair of g

so let's all those of you were referred me before will have single speeches

so party c-band whereas to designate one of the objects a b or c like

eventually all three of them but one at a time

so in this case the participant is describing the hard disk under the table

this is what the asr heard

none of them is correct this is true asr output

and then we put they did in front

so get

that thing

we in the second image again the party c-band once the of the ball farther

away from the plate

which object they have

this is what they it's not hard

and again we add

the get them

and this time one of the interpretations


correct yes the first one

these results are deemed edge

the plate in the middle of the table

so we play the same game can speed up now

and are finally manage the cleanable crack yes that's what they set



this is what they aside and hard

and this time would

do move why because it's a big object

we cannot ask anybody to get the bookcase



this is a we collected our corpus and now we start we

the trial stage one

we collected demographic information gender english native in is whether that are native english speaker

age education

and we also corrected risk propane see the information because we are interested in the

effect of risk

so we collected these from work firearm and that s

six weeks

where is probably i

statements such as i follow the motto nothing ventured nothing getting

and six

risk of version statement my decision errors are always made on their carefully inaccurately and

there are six of each and we measure the agreement or now one to five

likert scale


these are our demographic characteristics of

in the stage one we had forty participants six of those were not reachable in

stage two so we are thirty four people

seventeen female seventeen male eighteen native english speakers sixteen on a leave

and these are the age and education

brought five

error for risk prone as just to give you an idea about the human condition

we subtract the

risk aversion from risk brown is so the sum of

all their scores

and this is what are pub population looks like they seem to be a more

recent prone then

risk of ours

so now

now we get to the real stage one

so as i said each participant was shown

the top for asr output for each request maximum twelve requests one for image one

pair i them you in each image

and they were shown versions of the images were all the objects are number

why because they could

peak any object to talk to

to respond

we had to be reached conditions low and high we told them that in the

lower it rests condition the respond there is in the same room as the requester

in the high risk condition the respond that is far away and it will be

in car a lot of inconvenience if they do the wrong thing


they had four response types to

choose from and the

they got explanations of what

each response main

in fact they only got these side

this it is for us


do means would you just fitch object number

and put the number of the object you would fix

com four i'm ease

you want to last did you mean object again object number

choose which object did you mean

even list of object and rephrase ease i can hear you

i want you to restate so they had four response types to choose from

so this is a sample items so now we see the same room we so


but all the objects are numbered

and this is what the survey looks like soul

you may have the four out bolts assuming that you are in the same room

of the speaker

select one of the responses

get object number did you mean object which object did you mean and for rephrase

we actually gave them the option

to say rephrase the object rephrase the position or rephrase the whole sentence

now we distinguish because the asr makes most of the errors on the object not

on the location

and then

we went assume that

this peak at seen a remote location would you change your hands

and we asked the same coast


after stage one we got city corpora


so we had

five hundred and eighty four responses so

two hundred and i two request standard to race conditions


it will become clear why we have to be corpora so the first one he's

response corpus

response corpus he's what answers we got from our parties what

we see

okay what answers we got from our participants

and this is the distribution of the answers and their the law and their high

risk conditions

so do is clearly majority class

and we have come farm choose rephrase and as you can see


there is

let's do those in more conferencing chooses

and rephrases and their high risk condition

in addition we developed

two corpora

or dark or pause and classifier corpus so what is a dark or both

and the

responded to every c

why did we want double talk or both because there is a lot of the

variability between people and we wanted to see how user variability affix

the result

and the either in the final corpus is called classifier corpus

and what we need ease we train the classifier


select responses based on the

based both on all of our corpus and on response corpus

or and i promised i would then yielded sorry

so this is why we throughout the

requests with more than one prepositional phrase because we wanted to restrict the features that

we used for training the classifier because we just want to the simple classifier

okay so

what does not response classifier look so that look like it assumes that

we have a spoken language understanding system that with don's ranked interpretations

we have to be types of classification features the asr confidence in the correctness of

its own outputs

how well an interpretation matches the description

the risk of the situation and for response corpus we also have the more graphic

and respect propensity information


i think weak example this is a close up of one of the rooms

the description is the browns to linear the table


these two stools match well the description

the one

the one over there is a bit closer but their balls

are pretty good match

what about the classes so how the classifier do

we tested the whole bunch of classifiers and random forest one


these them only the main thing to note is

the bottom line of course ware doing better or and this score pause then on

the corpus of older people

why because there was a lot of variability in responses and their the exact same


but this is just

before you think i'm wasting your time

and this is not important for the purposes of this paper

so now

we proceed to experiment two

a year not have to two years later


each party c-band is shown

the same asr output this in images as in stage one

to race conditions again


a bunch of candidate responses

sourced from

the response type in response corpus for the wrong responses

and these responses

the response to speak by the classifier

and also

do confirm pairs so whenever one of these responses what to do if there was

no pun firm in that above three we are that the con four


if one of these was to confirm and there was no do

we added to do

of course we didn't repeat

several of these chose the same response we present to be done you want

now we had some

it's more challenges do and rephrase that direct renditions of the selections in stage one

but for confirming choose

we needed to do some instantiation

so for choose we chose the pictorially query value and two point d so we

would say is this what you want

in this is your confirmation the particular plate

four choose we had two options there are two plates on the table

and then

we presented

what was

which one do you want or do you want this or that


the pictorial version was restricted to only two or three options

if there was more options in the least

i mean nobody says these sort be sort of this or that

it's usually t c


and this is what the survey looks like again we have the same age

we have the output


now they get to choose between all these responses

and they get to rate them on

a likert scale be on u w t


okay going back to her question so how did we do

but this depends rating of the stage one responses are significantly lower

then the rating sets guide to this response types and their both wrists conditions what

do you mean f-score i

if you recall in stage one

they had to pick a response how would you respond

so we said okay

we in order to account for rate thereby s

we will say okay the one d p d is the rnn-based opinion of them

set of saw his their highest opinion of anything was if five

we have scribe to the response of five if it was a four ascribing to


but the rating was significantly lower well


these are this is still gram present the difference in the rating between

they're ascribed responses and their stage two ratings

so for a lot of them

they kept

so whatever we have scribe the also fold it was pretty goal


for quite a lot of them like to

hundred and thirty three for low risk and hundred and sixty nine for high risk

they see new fig on the reduce the rate

question tool

do participants preferred the stage one response type at the response type

in the paper we have balls and the and the classifier

here i'm only showing the classifier why the classifier the version of the classifier that

while using is the one trained on and he was not even trained on the


so what did we do we took

we to call their responses that



between stage two one stage one and then checked

the rate


only different response

so in a lot of cases

stage one was better than the classifier

in quite a few cases they were the same and

in enough cases

the classifier that is trained on somebody else did better than their own pretty of


so this is an example

what to get

and saying stage one

the user

we choose

but then in stage two we give choose a rating of one and come from

a rating of fine

but having said that

at the end of the day

participants rating of their stage one response types

is not statistically significant difference from the rating of different response types and their bowls

race conditions

so i need singles basically

influence on race just quickly

people were more conservative and their high risk which is that's expect that fewer doles

effect of risk on specific response times

so do and choose receive lower ratings and then i raised

and come from and rephrase what unaffected by risk

regardless of race

people rated confirm higher than do and choose with pictures higher than choose

text only


to conclude

people's preferences are

fluid over time

various reasonable responses may be acceptable and as we saw a classifier that trained on

a small non-target

corpus produce find responses

recently influences people studied used to with some response time

and what does that mean

well this has implications for training and evaluating dialog systems but this was in a

restricted set been wonderful dialogues would

the pretend robot

so more studies are required


we have some time for questions

thanks it's a and very interesting experiment to

and i think it does show clearly that there's some variation in response permitted which

we see another experiments to i'm not i'm not sure how you come to the

conclusion that the users are fluid through time

given that you're you tell you actually asking do something different like rating their response

rating response as opposed to choosing responses a different task

and if you assume that

users don't have just a fixed choice of mine bits of kind of a probability

distribution or utility distribution and you're forcing a choice so they pick one and if

you sampled again from the same distribution you'd expect a certain amount of variation so

is it really that users are changing over time or that you're the rolling the

dice and you get a

a different number sometimes the second time

yes this is a limitation we spot the that one

well or we can assume he's

yes whatever the actual

they must have a the reason for choosing need then

they thought they were making perfect sense

and then you and they were given the exact same options and then in

in rate of pay

there were okay with other options that's i mean

or what i mean

to me that e d case louis

should we have done the experiment differently in retrospect

yes probably but

to the intention the original intention of the experiment

was not to do this longitudinal study we kind of stumbled upon

the longitudinal part

but the okay to ask this indicates that the

you know things are not that is

cut and dry is

a lot of people believe that

they are in anything reasonable goals

we have time for another question

can you go back to select twenty four actually think


the idea to fix the number in my head otherwise

i couldn't mm

there was the conclusion not so much a graph


the next one

it doesn't one

sorry i had a hard time

following the reasoning here did you didn't you just show us that it is only

it was different no i sold there were differences

yes over or when you come when you do pairwise comparison along with statistical significance

testing was no

so although it up here sometimes this wean sometimes that queens

when you do

there might bear it's not statistically significant at all

we didn't wilcoxon signed-rank



alright let's think the speaker is again