Speech Transcript - Changing the Level of Directness in Dialogue using Dialogue Vector Models and Recurrent Neural Networks

okay so hello i'm these are processed and from one university already introduced

thank you for having

and i'm going to talk about changing the level of directions on the dialogue

so when i first had that's have a little motivation of why this could be

useful

if we look at human dialogue for example one person could say you want to

you to sell it repeats not and for some reason the other person decides not

to answer that question directly and test i prefer warm il

i want you human dialogue we can easily say okay that's itself

and then the person could

shoes to be more polite did not say directly you should really go on a

diet

and just say that pizza has a lot of countries

and then the other person

it's not offended and can say okay take the summit

so if we have a look at the same conversation with them

dialogue system which is not equipped to handle in directness

we can run into a number of problems

for example if the system says a do you want to excel at a pizza

and the human says i'd rather have a one meal

if the system is not equipped to handle this indirect this and just expects a

direct translate won't understand that

and then of course is to repeat the question and the user has to more

directly state-of-the-art sir

which is not that bad but could be handled better prices

but system if it could understand this indirect version of the answer

and another problem we have that is in the output because sometimes as humans we

expect our conversation partner to not really be direct

so if the system not chooses to be directed and say you should not itself

and the human will be very angry

so it would be better or if the system could handle in directness well on

the inputs and i and on the output side

and that is why

the goal of my work is changing the level of directness of an utterance

now i want to have a look at the algorithm a whole want to do

that

at first i will give an overview of the overall algorithm and then to address

some challenges specifically

so my algorithm works with the three different types of input from the current utterance

and the previous utterance and the double

and a pool of utterances that it can choose from to exchange the current utterance

the next step then is to evaluate the directness level of those utterances

and from that we get of course the directions of the current utterance

and the directions of every utterance and we need the previous utterance

because the directness is of course depending on what was said before

and we can have different levels of in directness depending on the previous utterance

and the next step then is

to filter all the utterances so that we only you have the pool of utterances

we can choose from

it have the opposite directions of the current utterance

and the last step we have to see

which of those utterances is the most similar in a functional manner to the current

utterance

which then leaves us with the utterance we can exchange it for

so two challenges in this algorithm one is

the directness level how can we estimate that

and the other one is

how do we assume which one which utterances are functionally similar

so that start with that for is what is functionally similar

i define that as the degree to which two utterances can be used interchangeably in

the dialogue so they fulfill the same function in the dialogue

and

as a measure of course functional similarity i decided to do that with a dialogue

act models

they are inspired by work spectrum models so they follow the same principle

and that

utterances in the back for space in a manner

the utterances appearing in the same context are mapped controls vicinity to each other so

if two utterances

are used in the same context

it's very likely that they

can be exchanging in the same the

the median distance and the spectral space is then used as an approximation of the

functional similarity

i'm pretty sure that works because i have already published paper it outright this year

and i will quickly summarise the findings of the paper so you can see why

this is good feet

i have evaluated the accuracy of clusters

then i have hard k-means

in the dialogue vector space and compare them

to the ground truth of clusters by hand annotated dialogue acts

so want to see of improving in the dialogue vector space corresponds to the annotated

dialogue acts

and i didn't cross corpus evaluation so on the dialogue act the models are trained

on a different corpus then the clustering was performed on and just you can see

on the left side the risks the accuracy is very good

and

that's why i think

at a dialogue act models work very well for the estimation functional similar utterances

so that's get to the estimation of directors which was the second challenge

you can already see an architecture here this is for a recurrent neural network is

to make the directness with the supervised learning approach

and as an input is used the sum of weight vectors on the one

so every work in the sector a in the an utterance

we use the word vector and just

at all of them

and also i use than the dialogue vector representation as an input

and the suspect it's a reference so we have a twenty data connection that also

that's just get

previous an utterance or and the input of the previous utterance

its output

we have i've made as a classification problem

so the output just the probability of the utterance being either a very direct so

i wanted to drink and so for example

slightly indirect you we have can i get a ticket trajectory itself which is not

quite the same but still has all the main works in there that are met

necessary for the meeting

and then very indirect where you just say i don't like meat

and hopefully the other person can get that

so this has not been tested before so as part of the evaluation for this

work

i also evaluated the how well the estimation of directors with this approach works

so and with that let's get to the evaluation

so as a set on the one hand the accuracy of the direct estimation was

evaluated

and of course the accuracy of the actual utterance exchange also

and for that on we of course the ground truth that means we need

a dialogue corpus that contains utterances but we can exchange

we need to of course and annotation of the directness level

and an annotation of dialogue act and in order to see if we made a

correct exchange

it was

impossible to find holes like that

so i also wasn't sure we could actually

get a corpus like that ourselves because it's very difficult

do not inhibited the naturalness of conversation

well still the same to the participants

okay we need to this meeting in different phrases a different directness levels

to make sure that there are external equivalent utterances in the corpus

so for this i decided to do an automatically generated corpus to want to present

now

so that it calls contained

the definition of the dialog domain with system and user actions

and just accession rules under which set which system

what which action could for the which other actually

each action had

multiple utterances that actually a to a used were great

and of course of directors level depending on the previous utterance

then we started with the beginning with the start actually

and then just record simply cut all the successors

but also selsa's again until we reach the end

and thereby generated all the dialogue flows that where possible with the time domain which

no and we defined

and the weighting was then choosing randomly and this resulted in more than four hundred

thousand dialogue flows

and

about or for working is very dialogue act she

for example you can see here yes could be worried it is a great i'm

going forward to it or that sounds to the shoes

if anyone

what the previous utterance was

or i would like to order pizza can order pizza from you

the topic of those conversational style story

so for example ordering a pizza or arranging for joint coding together

and it

i try to incorporate

many elements of human conversation

so it for example i had over a string

the one misunderstandings a request for confirmation corrections and things like that

and as already mentioned context-dependent directness levels

so for example and you have time today

can be answered with i have planned it is in

which is not a direct answer so it hasn't directors three

and it finds today i haven't planned anything

so here we have

different a question and before that

so this time it's a direct answer and achieves the directness the number of one

so of course with an automatically generated out or was there are some limitations

we have that's variation in a natural conversations of course

and well

with regard to the dialogue flow

and to the weighting that here

and that very likely means

it's

more predictable and therefore easier to or

however i also see some advantages of this approach

on the one hand we have a very controlled environment

we can make sure that

for example is an actual on server

in the corpus utterances

so we know that there is a valid exchange and if we didn't find that

the for the our algorithm and not just that there is no

correct utterance in the corpus

and also

we know

because

the corpus was

was not annotated but generates the ground truth

at this ground truth is very dependable

and also i think it's an advantage that using this approach we have

a very complete dataset we have all the possible flows we have

many different weightings

and i think that having this for small application

can meet implications for if we actually have a lot of data and the approach

the full coverage so

for example usually if i could just collect dialogue i want

one have a lot of data and i won't this poor coverage

but a larger companies may but i just don't get the data and that we

do this

small

small what complete set that we generated

that can

then have some implications for what if i could get that

i at all

so for our results this means is of course

that

they don't do not represent the actual performance in the applied for spoken dialogue system

which test we don't have natural conversations

so it's very likely that it will perform worse

but we can replace potential or approach given ideal circumstances

so i think it still or some an average rate

so with that that's get to the actual results

at first the accuracy of the directness estimation

here we use them as input for the

a dialogue vector model

it was trained on our automatically generated calls

and we used word actual models that were trained on the google news call and

you can see the reference for that

as dependent variable of course we have the accuracy of correctly predicting the level of

directness as annotated

and it is indeed and river

independent variables we use

versions where we

with and without where actors as input to see if we improve use all

and also

we wanted to see

if the whole the size of the training sets impacts the classifier

so we used of course ten fold cross validation as usual which leads to a

training corpus of ninety percent of the data

and we also tested it with when we only use ten percent of the data

also we use different other tactile models there we also used different

sure

sizes of the dialogue corpus that we generated how many of the dialogs we included

the actual train

and he can see the results

we could achieve a very high accuracy of darkness estimation

but keep in mind it's an automatically generated corpus so that plays the role in

that of course

the baseline for the majority class prediction would have been zero point five to nine

one

and can clearly outperform that

we can see a significant influence of both the size of the training set

and this is whether or not we include the word vectors

and

i think then

that the word vector as input to improve so much of the estimation results really

speaks of the quality of those models that what we have the speaker data

but i think extensive work come on six is

so that should not you problem

what could be a problem is and the size of the training set

because this is annotated data it's a supervised approach

if we want

choose a scale this approach

we would need a lot of annotated data so perhaps in the future we could

consider i'm

unsupervised approach for this

that doesn't need to a lot of annotated data

so the accuracy of i utterance exchange for the functional similarity we again used

the dialogue act models from the automatically generated calls

and for that are just estimation we use different portions of the train crash classifier

that are just presented

and as dependent variable and the percentage of correctly exchange utterances

and independent variables here where the classifier accuracy

and again the size of the training corpus for the dialogue act models

g you can see the results

the best performance we could achieve overall was zero point seven

percent of utterances that were correctly exchange

and we have a significant influence of both a classifier

accuracy

and that the size of the training data for the dialogue act models

and a common error that could see

it was made by the algorithm

was that the utterance exchange was done

either with more or less information than the original utterance

so for example and stuff i want something spicy

it was exchanged with i want a large pepperoni pizza and large of course is

not included in the first sentence

this points to that a dialogue act models as we trained and cannot really differentiate

that well between those a ring in

but this could be solved with just adding more context to them so

during the training take into account more utterances in the vicinity

we can see here the importance of a good classifier and a good similarity measure

the similarity measure i don't that's a problem because

it's on annotated data so we can just take large corpora of dialog data and

use that

again the annotated data is here the real challenge and

we should consider the unsupervised approach

a short discussion of the results

i think the approach shows a high potential what the evaluation was done in a

theoretical setting

and we have not applied to an full dialogue system

and therefore they are still some questions to be answered

so in this corpus we have this variability and in a natural dialogue

that means that

very likely the performance of the classifier and style vector model will decrease in an

actual dialog

to compensate for that we then we need more data

and we have the problem that

we don't really know if in an actual dialog hope was suitable alternative to exchange

actually exist

again if we have an increasing amount of data it becomes more likely

what was and it's not sure

so perhaps as a future work we can look into the generation of utterances instead

of just their exchange

and i dunno point is the interpolation of user experience and the accuracy of exchange

because at the moment we don't know what

actually we see we actually need to achieve

to improve the user experience

so that is also something we should look into

that system

the end of my talk i want to conclude what i presented to you today

i discuss the impact of interest in human computer interaction

and propose an approach to changing that both directions of matter

the directness estimation is done using recurrent neural networks the functionality measure

uses dialogue act models

and the evaluation shows the high potential this approach is also a lot of future

work should you

it would be good to have a corpus of natural dialogues annotated with the director's

that to use that as an evaluation

there would be benefits of to an unsupervised estimation of the directness level

and

also an evaluation on an actual dialog corpus

would give more insights on how that actually impacts the performance

and the generation of suitable utterances would be just desirable because we don't actually know

if an

the right utterances in the corpus

and finally of course we would like to a okay apply this to an actual

for dialogue system

thank you very much for your attention

no i did not evaluate and set

yes a lot of my somewhere in this regard for differences there the directness is

a very major difference that exists between cultures so therefore the source of major interest

for me

yes

i think it would you really good

such a coarse and

i'm thinking about ways like i think one of the main difficulty is there is

as i said i'm coming from an actual difference

so for example i would expect a german i will be even more direct and

for example japanese

then we have the translation problem we can't exchange german utterances for japanese utterances so

that makes it difficult and i'm not sure how to ensure for example in a

german that the participants were actually use a in alright version

the direct utterances as well

so there is a little bit of a problem

that sounds interesting thank you very much

so this was small part of the error rate

and there are just use the k-means clustering algorithms

to find clusters

in this work i don't actually define clusters but just use the closest one

no i used it is so basic i

it's director pretty of is

if it's a colloquial re-formulate dislike you know

and i you know works from the original sentence here in the exchange sentence then

it's a here and very direct

Changing the Level of Directness in Dialogue using Dialogue Vector Models and Recurrent Neural Networks

Oral Session 1: Generation 1

Louisa Pragst and Stefan Ultes