Speech Transcript - Strategy and Policy Learning for Non-Task-Oriented Conversational Systems

so i'm sure apparent carnegie mellon this is a collaboration work with nail i in

turn

and my abide alan black and alex rudnicky over there

today i'm gonna talk about strategy and policy learning rate nontask-oriented conversational system

so as we are now that non-task arrogant conversation systems allow people color the chat

bots or social chat

so the task is empower we say social chatting and then always people ask me

why do we need social chatting

so the motivation is simple actually

so if we see that human conversations we actually use a lot of social chatting

in our conversations when you're meeting someone very certain task you actually try to do

some social chatting to use that presenting the conversation it talk about your weekends before

you got into a meeting a genders

yes come social chatting is there a certain type of conversations most abuses social tie

with your coworkers was your friend of course it has all their application feels like

education

you want eager to turn to be social intelligent to be able to use these

are kind of clusters are chatting to interleave the conversations

i think health care

in language learning we say that in a complex task data used in these areas

social chatting that essential

so there are we wanna designing a system that is able to perform social chatting

and so we say we have some of the closing in mine one is just

tend to be appropriate

well the system to be able to go into dumps with the conversation

what the system to provide a variety of answers to suited when users

there

well we wanna say the main goal is to make sure the system is coherent

apart re in a signal turned and turn level

so we just applying this of happiness that occur in the response coherence with the

user utterance so we have three labels around an interpretable inappropriate or

so later we're gonna use these labels to you about a girl systems

their first-order we need a lot of data i don't to evaluate the system in

the same time we also wanted to have are fairly easy pipeline to actually do

the evaluation

people have been working on the art systems a know that it's hard to get

data

are you one

kristen is one single they don't like

and user evaluation you have to have a user to interact with the system it's

also very expensive

so here we in order to expedite the process we average about that are taxed

api so people can access the channel on web browser

we can have multiple people to talk to that at same time it's multi-threaded

and so we also automatically connect to the user to a rating task harder the

conversation that they can rate whether certain response is a problem not we give them

a whole dialogue history to review

so i'll we make it open sort of both the data and the co

so you can get a form i get

so we also have like demos that around on a on amazon mechanical turk some

more machine which re sorry the rounds

twenty four hours seven days a week and so if we go over so we

just gonna d a little bit so here is here

years screen then you type in something for example the job losses

i like me to the egg harbour we talk about music

there sure

what do you want

what do you want to talk about

there was a almost everything and you also the interaction is very easy it's a

very nice way to motivate the user to interact with the system

and it is also very easy way to evaluate data so we sometimes posted a

mechanical turk or social networks to actually get more user

there

let's take a step back to you look at the previous works about task oriented

system

so we usually are familiar with this architecture once we get the user input

we do language understanding that we going to a dialog manager used decide what to

generate and the end we have system output

so a lot of work have been doing that if there is some not understanding

happening in the system so some something that users that is not

comprehensible for the system

now a lot of people have designed conversational strategies to handle these are is for

example we sing can you say that again or dummy we are very familiar with

copies conversational strategies

it can however

there are a lot of work and

allowing you numbers are can be agenda cmu have been dealing was

you think on the p a tuple or the mpe to optimize the process of

choosing which strategy to use that which plane globally to optimize the

task completion rate so

and this in the previous work on task oriented system can we you do that

on down task current system

so the research questions

as can we d and can we develop conversational strategies to handle

for example we really care about the proper in it and can we and of

this you know probability nontask-oriented system

and can we actually use this kind of globally plan policy to actually regulate the

conversation for instance which i think

you

i apologise for their pipeline

question

already apologised for their

disturbance

so we try to train trying to say that can we use conversation and design

conversation strategy and conversation policies

to help the non-task utterances tend to be more appropriate

zero and here we design of a architecture which is very similar to a task

or an system

so here we phrase first about once we get the user input then we try

to use some context tracking strategies that we develop

and then we're going to say that we generate a response

and then if their responses and the system think there were i had

the system has a high confidence that the response is a good one

then we just

produce the system response back to the user

if there is a system is not confident that's a good response

and we got into you find some block and some semantic dialogue lexical-semantic strategy that

we introduce lately to deal with the low confidence if that if that works we're

just use that those methods to generate output if that

none of the conditions trackers in these strategies and we go into or engagement of

happiness strategies to actually a pretty and generate with five

there are in yesterday's prosody or we also talked about you know another system which

is similar to this one we also take a lot and engagement in to the

consideration of the whole top

process

so are we talk about

then we have three sets of strategies that we're gonna talk later in details about

how can we make the system more appropriate now how

also policy that actually and

actually choose between different strategies to make the whole process in a battery

to optimize the whole process globally

there we say that we have two components we're gonna talk about the response generation

side and the conversational strategy selection right

the rest of a how do we track context so we

we have first about anaphora resolution which is like we prove that we bring mainly

that problem resolution

we because we wanted to make a strategy that start ninety percent of the case

and so for example like to you like taylor swept

which attack the tailor swept

and

it's a yes i like are a lot and we replace her with a list

but here for the next response generation

we also do response ranking with a history similarity

basically we use word to back to rank the similarity between the candidates and the

previous word really utterance

for example take taxes i watch a lot of

baseball game a whole

and the units there what you like most

so here that we have two candidates

so why is that like tell us what's

the others are like were he bounced up so here we did if we do

the word two vectors similarity past we will narrow down

the second one is preferred because they are more on the same hazing system in

semantic

then we go into your response generation methods

so after we ugh consider the context and history inside and then we do their

actual generation so we have two methods that we actually is

and select based on the confidence

one keyword which we what we're triple matrix

basically we of

we find the keywords in the data i'll find the user the keyword thing the

user's response and a match that in the database

no we're turn the corresponding response that has the highest weight

aggregated weight

there we use the data that would you existing interview transcript statist antenna

we also collect their personal data standard using mturk

the other after the there are skipped on your network

model

so basically it we are using encoder and decoder to decode to generate the response

we all concept i don't is on sixteen in this matter

basically a we have two

a message i we select the most of the wonder with the highest confidence

here

if the confidence that high in the response generation model we just switch and the

response back to the user

if it is low

what we gonna to you as

right

apologise for the expected being the

or point following when greatly

i know how well

right

maybe you are okay

okay

so here we say that we go over some lexical-semantic strategy if the confidence generation

score is low

then finally were talk about other one

there

there we designed a row or strategies for example if the user repeats and twelve

we're say you already is that

and if the user is very it's replying with single where we're just react to

that saying like you're do say something incomplete sentence

our us to have grounding and technology a routing strategies

a named entity

so basically we detect the name entity and try to find that in the database

and knowledge base and try to your use a template to fix

so for example do you like clinton which content i'm talking about bill clinton the

for the do you know state

or kilogram and

the democratic can

so we also have run to out-of-vocabulary so for example we detect there are other

work average then you template to generate the sentence and the same time we update

the wer recovery as well

so for example you to say

your very confrontational take into excel

what do you mean by confrontational

there we a lot of queries try to get iq value to how these strategies

are doing based document annotation about a proper in

we can see that mostly people think they are appropriate where there are some problems

for example if the named entity the wrong

then the

a generative responses were not be a correct

for example we also have like your other work have the words if the user

is asked to using some of more casual way of spelling is that you checked

are trying to confront with that and that you there is find a inappropriate

she intends to you has to existing already to trigger that come strategy so if

none of the conditions triggers we actually going to or engagement of province strategies should

to actively try to bring that you there and the conversation

zero you look into previous literature

basically we find that in communication cultures and literatures active participation it's really important

also like positive feedback or encouragement we mainly implement a set of strategy that

goes with the active participation strategy

and zero well whenever we start at a conversation we usually pick a topic to

in the shape that you to the user

and then we would design each strategies which with respect to the topic and so

we have to that you can stay on the topic or change the topic so

if we use try to stay on the topic we could tell jokes they do

you know that people usually spent for more time watching sport the actual playing any

initiate activity for example you want

game together sometime

and talk more let's talk about more about work

you can also change the top

for example like how about we talk about

and the topics with an open question that's interesting you sure with mu some interesting

news on the internet

so basically we also evaluated on the five minutes of these strategies based on the

you there's really so here we only use a randomly selection policy which means that

whenever we find the

and the generation was not a gesture generation how that's as well

and the not of their lexical-semantic strategies are triggered

we go over to these we randomly select one of these strategies for that

and we do find some of them are doing pretty good

for example like you're initiation telling more

so by some of them are actually doing pretty bad for example joe so maybe

five

there without the contact these strategies can go wrong very much

so here is one of the humble

apologise again

sure you make up to time they're here the paillier case we can see so

take out that a lot really like politics like talk paul and there's no i

don't like politics zero why that and the user i just don't like politics

and second here and then goes interior a strategy that but we

watch of them together sometime that i told you got all want to talk about

politics

basic we find there is the in more poppy nothing side of the

whenever if we struck and select the strategy with our with i'll taking the context

into consideration that will look into closely to the semantic context we find that user

r expressing negative sentiment in rural and at this time

the correct way is to

pick a strategy which is that's which topic

actually can

handle the situation when you there is happy about sure ideal watching your

so we say that we need to model the context into you their strategy selection

there

basically we have to use a of work we wanted to a voice it's improper

in this in a proper in it

then we using reinforcement learning to do the global planning so we take some of

state variables which are their uncertainty and which are some of their variable so we

mentioned before

for example system problem is competent

there is a previous utterance sentiment competent and number of each strategy executed and term

position most recently used strategy so we take all these into consideration in training our

marines were smelling policy

we use another chat about as assimilated to train the conversation and

conversation

so we have a reward function

which is the combination of response to prominent a conversational taps any information gain

there are the purpose we already defined it

and then we train their about binary classifier based on the human like not label

so this automatic predictor is gonna used in the reinforcement learning training process

and also the company we define conversational data sets the constructed for all utterances

and your role and that keeps on the same topic we also and are on

the other automatic predictor based on the human annotation

and finally we have the finer the other one which is the information gain which

accounts for the variety of the conversation

so we just like the number of unique where and the post that you very

and the system have spoken

so in the end we have way we

i am prickly decided to wait to you are trained and two for the reward

function which we think later we well we were gonna be using a machine learning

about six to train the way

zero

we have another to policy that we compare our reinforcement learning policy against with first

of the random selection policy

the other is a local greedy policy which is based on the previous three sentence

sentiment to decide a strategy

for example

i've the user is positive in a row we can say can talk more about

this topic

if it's an active with which are policy as which are topic

so in the end we define what we have training where we are using their

reinforcement learning train piloting and testing or not

with real human interacting with the system

we decrease the in a problem in it

we increase the computational adapted and there are totally information gain

they're the conclusion and we think the conversation and strategies design

a unit lexical-semantic strategies are you in a are useful

and considering and conversational history is useful

and integrating out also didn't user and different upstream ml models are in the reinforcement

learning is useful

any questions

okay

yes so that's a good question so we basically we do you have like a

different surface form in this kind of designing this

strategy

this is actually our future work we wanted to actually to see how can we

generate sentences was pragmatics inside of it

right now it's some is based on some templates

so basically we tried to use different were in different worrying about

it is still templates not really a very general

jury

and that's a good a question so here the idea we trying to say that

we trying to integrate as much as

their uncertainty of the conversation into the dialogue planning definitely of all these kind of

where two vector

is also an extra information can get into their own strategy selection

or a star for if you're spoken dialogue system asr a error is

so i think you definitely if you can optimize and considering all these uncertainties instead

of the dialogue system we would be better

but we haven't done that yet

you much states

basically it there

it's like expansion and the space will expanding exponentially of you had a more variables

and their

any other questions

and that's a good question so basically we ask the user very so we just

give the using with respect to user's utterance do thing

the response is appropriate coherent no not

so sometimes people think or if they're changing topic is kind of right on time

they think it's appropriate

if it's not

and they would think it saying appropriate

there is totally we give them pretty broad interpretation of how appropriate it is

so a lot of people do you pick context into consideration what they're waving them

true

pretty well pretty right so that's why we try to in the reward function we

try to and come for the variety as well

in the optimisation function zero basically

appropriateness is like a one aspect of making the system communicable

and the others make a being a file on there being provocative or anything else

could be add up on that

so i think it's like a different your inbox

and their variety or personalisation the something could be considered

Strategy and Policy Learning for Non-Task-Oriented Conversational Systems

Oral Session 7: Non-task-oriented dialogue systems

Zhou Yu, Ziyu Xu, Alan W Black and Alexander Rudnicky