Speech Transcript - Strategy and Policy Learning for Non-Task-Oriented Conversational Systems

0:00:15	so i'm sure apparent carnegie mellon this is a collaboration work with nail i in
0:00:21	turn
0:00:21	and my abide alan black and alex rudnicky over there
0:00:25	today i'm gonna talk about strategy and policy learning rate nontask-oriented conversational system
0:00:30	so as we are now that non-task arrogant conversation systems allow people color the chat
0:00:36	bots or social chat
0:00:38	so the task is empower we say social chatting and then always people ask me
0:00:43	why do we need social chatting
0:00:47	so the motivation is simple actually
0:00:49	so if we see that human conversations we actually use a lot of social chatting
0:00:54	in our conversations when you're meeting someone very certain task you actually try to do
0:01:00	some social chatting to use that presenting the conversation it talk about your weekends before
0:01:05	you got into a meeting a genders
0:01:08	yes come social chatting is there a certain type of conversations most abuses social tie
0:01:14	with your coworkers was your friend of course it has all their application feels like
0:01:19	education
0:01:20	you want eager to turn to be social intelligent to be able to use these
0:01:24	are kind of clusters are chatting to interleave the conversations
0:01:28	i think health care
0:01:30	in language learning we say that in a complex task data used in these areas
0:01:36	social chatting that essential
0:01:39	so there are we wanna designing a system that is able to perform social chatting
0:01:44	and so we say we have some of the closing in mine one is just
0:01:48	tend to be appropriate
0:01:50	well the system to be able to go into dumps with the conversation
0:01:54	what the system to provide a variety of answers to suited when users
0:02:00	there
0:02:03	well we wanna say the main goal is to make sure the system is coherent
0:02:06	apart re in a signal turned and turn level
0:02:10	so we just applying this of happiness that occur in the response coherence with the
0:02:14	user utterance so we have three labels around an interpretable inappropriate or
0:02:21	so later we're gonna use these labels to you about a girl systems
0:02:25	their first-order we need a lot of data i don't to evaluate the system in
0:02:30	the same time we also wanted to have are fairly easy pipeline to actually do
0:02:35	the evaluation
0:02:36	people have been working on the art systems a know that it's hard to get
0:02:40	data
0:02:40	are you one
0:02:42	kristen is one single they don't like
0:02:44	and user evaluation you have to have a user to interact with the system it's
0:02:49	also very expensive
0:02:51	so here we in order to expedite the process we average about that are taxed
0:02:56	api so people can access the channel on web browser
0:03:02	we can have multiple people to talk to that at same time it's multi-threaded
0:03:06	and so we also automatically connect to the user to a rating task harder the
0:03:12	conversation that they can rate whether certain response is a problem not we give them
0:03:17	a whole dialogue history to review
0:03:20	so i'll we make it open sort of both the data and the co
0:03:25	so you can get a form i get
0:03:27	so we also have like demos that around on a on amazon mechanical turk some
0:03:32	more machine which re sorry the rounds
0:03:35	twenty four hours seven days a week and so if we go over so we
0:03:40	just gonna d a little bit so here is here
0:03:44	years screen then you type in something for example the job losses
0:03:48	i like me to the egg harbour we talk about music
0:03:50	there sure
0:03:52	what do you want
0:03:54	what do you want to talk about
0:03:59	there was a almost everything and you also the interaction is very easy it's a
0:04:04	very nice way to motivate the user to interact with the system
0:04:10	and it is also very easy way to evaluate data so we sometimes posted a
0:04:14	mechanical turk or social networks to actually get more user
0:04:25	there
0:04:26	let's take a step back to you look at the previous works about task oriented
0:04:31	system
0:04:31	so we usually are familiar with this architecture once we get the user input
0:04:36	we do language understanding that we going to a dialog manager used decide what to
0:04:41	generate and the end we have system output
0:04:45	so a lot of work have been doing that if there is some not understanding
0:04:48	happening in the system so some something that users that is not
0:04:53	comprehensible for the system
0:04:55	now a lot of people have designed conversational strategies to handle these are is for
0:05:00	example we sing can you say that again or dummy we are very familiar with
0:05:04	copies conversational strategies
0:05:07	it can however
0:05:09	there are a lot of work and
0:05:11	allowing you numbers are can be agenda cmu have been dealing was
0:05:16	you think on the p a tuple or the mpe to optimize the process of
0:05:21	choosing which strategy to use that which plane globally to optimize the
0:05:26	task completion rate so
0:05:29	and this in the previous work on task oriented system can we you do that
0:05:32	on down task current system
0:05:35	so the research questions
0:05:36	as can we d and can we develop conversational strategies to handle
0:05:41	for example we really care about the proper in it and can we and of
0:05:45	this you know probability nontask-oriented system
0:05:48	and can we actually use this kind of globally plan policy to actually regulate the
0:05:53	conversation for instance which i think
0:06:07	re
0:06:09	you
0:06:33	i apologise for their pipeline
0:06:35	question
0:07:02	already apologised for their
0:07:03	disturbance
0:07:05	so we try to train trying to say that can we use conversation and design
0:07:09	conversation strategy and conversation policies
0:07:12	to help the non-task utterances tend to be more appropriate
0:07:16	zero and here we design of a architecture which is very similar to a task
0:07:20	or an system
0:07:21	so here we phrase first about once we get the user input then we try
0:07:26	to use some context tracking strategies that we develop
0:07:29	and then we're going to say that we generate a response
0:07:32	and then if their responses and the system think there were i had
0:07:37	the system has a high confidence that the response is a good one
0:07:40	then we just
0:07:42	produce the system response back to the user
0:07:45	if there is a system is not confident that's a good response
0:07:49	and we got into you find some block and some semantic dialogue lexical-semantic strategy that
0:07:54	we introduce lately to deal with the low confidence if that if that works we're
0:08:00	just use that those methods to generate output if that
0:08:05	none of the conditions trackers in these strategies and we go into or engagement of
0:08:09	happiness strategies to actually a pretty and generate with five
0:08:15	there are in yesterday's prosody or we also talked about you know another system which
0:08:20	is similar to this one we also take a lot and engagement in to the
0:08:24	consideration of the whole top
0:08:26	process
0:08:27	so are we talk about
0:08:29	then we have three sets of strategies that we're gonna talk later in details about
0:08:33	how can we make the system more appropriate now how
0:08:37	also policy that actually and
0:08:39	actually choose between different strategies to make the whole process in a battery
0:08:46	in
0:08:47	to optimize the whole process globally
0:08:50	there we say that we have two components we're gonna talk about the response generation
0:08:55	side and the conversational strategy selection right
0:08:58	the rest of a how do we track context so we
0:09:02	we have first about anaphora resolution which is like we prove that we bring mainly
0:09:07	that problem resolution
0:09:09	we because we wanted to make a strategy that start ninety percent of the case
0:09:15	and so for example like to you like taylor swept
0:09:17	which attack the tailor swept
0:09:19	and
0:09:20	it's a yes i like are a lot and we replace her with a list
0:09:25	but here for the next response generation
0:09:29	we also do response ranking with a history similarity
0:09:32	basically we use word to back to rank the similarity between the candidates and the
0:09:37	previous word really utterance
0:09:41	for example take taxes i watch a lot of
0:09:44	baseball game a whole
0:09:46	and the units there what you like most
0:09:48	so here that we have two candidates
0:09:51	so why is that like tell us what's
0:09:52	the others are like were he bounced up so here we did if we do
0:09:56	the word two vectors similarity past we will narrow down
0:10:00	the second one is preferred because they are more on the same hazing system in
0:10:05	semantic
0:10:06	then we go into your response generation methods
0:10:09	so after we ugh consider the context and history inside and then we do their
0:10:14	actual generation so we have two methods that we actually is
0:10:18	and select based on the confidence
0:10:20	one keyword which we what we're triple matrix
0:10:23	basically we of
0:10:25	we find the keywords in the data i'll find the user the keyword thing the
0:10:29	user's response and a match that in the database
0:10:32	no we're turn the corresponding response that has the highest weight
0:10:36	aggregated weight
0:10:39	there we use the data that would you existing interview transcript statist antenna
0:10:43	we also collect their personal data standard using mturk
0:10:48	the other after the there are skipped on your network
0:10:52	model
0:10:52	so basically it we are using encoder and decoder to decode to generate the response
0:10:59	we all concept i don't is on sixteen in this matter
0:11:02	basically a we have two
0:11:03	a message i we select the most of the wonder with the highest confidence
0:11:11	here
0:11:12	if the confidence that high in the response generation model we just switch and the
0:11:16	response back to the user
0:11:18	if it is low
0:11:19	what we gonna to you as
0:11:22	right
0:11:45	re
0:11:53	apologise for the expected being the
0:11:55	or point following when greatly
0:12:05	like
0:12:06	i know how well
0:12:18	right
0:12:39	maybe you are okay
0:12:41	okay
0:13:05	so here we say that we go over some lexical-semantic strategy if the confidence generation
0:13:10	score is low
0:13:12	then finally were talk about other one
0:13:14	there
0:13:22	there we designed a row or strategies for example if the user repeats and twelve
0:13:27	we're say you already is that
0:13:29	and if the user is very it's replying with single where we're just react to
0:13:35	that saying like you're do say something incomplete sentence
0:13:38	our us to have grounding and technology a routing strategies
0:13:42	a named entity
0:13:44	so basically we detect the name entity and try to find that in the database
0:13:48	and knowledge base and try to your use a template to fix
0:13:52	so for example do you like clinton which content i'm talking about bill clinton the
0:13:57	for the do you know state
0:13:58	or kilogram and
0:14:00	the democratic can
0:14:02	so we also have run to out-of-vocabulary so for example we detect there are other
0:14:07	work average then you template to generate the sentence and the same time we update
0:14:11	the wer recovery as well
0:14:13	so for example you to say
0:14:15	your very confrontational take into excel
0:14:17	what do you mean by confrontational
0:14:20	there we a lot of queries try to get iq value to how these strategies
0:14:24	are doing based document annotation about a proper in
0:14:29	we can see that mostly people think they are appropriate where there are some problems
0:14:33	for example if the named entity the wrong
0:14:36	then the
0:14:37	a generative responses were not be a correct
0:14:40	for example we also have like your other work have the words if the user
0:14:44	is asked to using some of more casual way of spelling is that you checked
0:14:50	are trying to confront with that and that you there is find a inappropriate
0:15:02	she intends to you has to existing already to trigger that come strategy so if
0:15:07	none of the conditions triggers we actually going to or engagement of province strategies should
0:15:12	to actively try to bring that you there and the conversation
0:15:16	zero you look into previous literature
0:15:19	basically we find that in communication cultures and literatures active participation it's really important
0:15:28	also like positive feedback or encouragement we mainly implement a set of strategy that
0:15:34	goes with the active participation strategy
0:15:37	and zero well whenever we start at a conversation we usually pick a topic to
0:15:42	in the shape that you to the user
0:15:44	and then we would design each strategies which with respect to the topic and so
0:15:50	we have to that you can stay on the topic or change the topic so
0:15:53	if we use try to stay on the topic we could tell jokes they do
0:15:57	you know that people usually spent for more time watching sport the actual playing any
0:16:03	initiate activity for example you want
0:16:05	game together sometime
0:16:07	and talk more let's talk about more about work
0:16:11	you can also change the top
0:16:12	for example like how about we talk about
0:16:15	and the topics with an open question that's interesting you sure with mu some interesting
0:16:20	news on the internet
0:16:22	so basically we also evaluated on the five minutes of these strategies based on the
0:16:28	you there's really so here we only use a randomly selection policy which means that
0:16:32	whenever we find the
0:16:34	and the generation was not a gesture generation how that's as well
0:16:38	and the not of their lexical-semantic strategies are triggered
0:16:41	we go over to these we randomly select one of these strategies for that
0:16:47	and we do find some of them are doing pretty good
0:16:50	for example like you're initiation telling more
0:16:54	so by some of them are actually doing pretty bad for example joe so maybe
0:16:58	five
0:16:59	there without the contact these strategies can go wrong very much
0:17:03	so here is one of the humble
0:19:15	apologise again
0:19:17	sure you make up to time they're here the paillier case we can see so
0:19:24	take out that a lot really like politics like talk paul and there's no i
0:19:29	don't like politics zero why that and the user i just don't like politics
0:19:33	and second here and then goes interior a strategy that but we
0:19:38	watch of them together sometime that i told you got all want to talk about
0:19:42	politics
0:19:43	basic we find there is the in more poppy nothing side of the
0:19:47	whenever if we struck and select the strategy with our with i'll taking the context
0:19:52	into consideration that will look into closely to the semantic context we find that user
0:19:57	r expressing negative sentiment in rural and at this time
0:20:02	the correct way is to
0:20:03	pick a strategy which is that's which topic
0:20:06	actually can
0:20:08	handle the situation when you there is happy about sure ideal watching your
0:20:13	so we say that we need to model the context into you their strategy selection
0:20:19	there
0:20:20	basically we have to use a of work we wanted to a voice it's improper
0:20:23	in this in a proper in it
0:20:25	then we using reinforcement learning to do the global planning so we take some of
0:20:28	state variables which are their uncertainty and which are some of their variable so we
0:20:34	mentioned before
0:20:35	for example system problem is competent
0:20:38	there is a previous utterance sentiment competent and number of each strategy executed and term
0:20:45	position most recently used strategy so we take all these into consideration in training our
0:20:51	marines were smelling policy
0:20:54	we use another chat about as assimilated to train the conversation and
0:20:59	conversation
0:21:00	so we have a reward function
0:21:02	which is the combination of response to prominent a conversational taps any information gain
0:21:09	there are the purpose we already defined it
0:21:11	and then we train their about binary classifier based on the human like not label
0:21:17	so this automatic predictor is gonna used in the reinforcement learning training process
0:21:23	and also the company we define conversational data sets the constructed for all utterances
0:21:28	and your role and that keeps on the same topic we also and are on
0:21:34	the other automatic predictor based on the human annotation
0:21:39	and finally we have the finer the other one which is the information gain which
0:21:43	accounts for the variety of the conversation
0:21:45	so we just like the number of unique where and the post that you very
0:21:49	and the system have spoken
0:21:52	so in the end we have way we
0:21:54	i am prickly decided to wait to you are trained and two for the reward
0:21:58	function which we think later we well we were gonna be using a machine learning
0:22:03	about six to train the way
0:22:05	zero
0:22:06	we have another to policy that we compare our reinforcement learning policy against with first
0:22:10	of the random selection policy
0:22:12	the other is a local greedy policy which is based on the previous three sentence
0:22:17	sentiment to decide a strategy
0:22:19	for example
0:22:19	i've the user is positive in a row we can say can talk more about
0:22:23	this topic
0:22:24	if it's an active with which are policy as which are topic
0:22:28	so in the end we define what we have training where we are using their
0:22:31	reinforcement learning train piloting and testing or not
0:22:35	with real human interacting with the system
0:22:37	we decrease the in a problem in it
0:22:40	we increase the computational adapted and there are totally information gain
0:22:45	they're the conclusion and we think the conversation and strategies design
0:22:50	a unit lexical-semantic strategies are you in a are useful
0:22:54	and considering and conversational history is useful
0:22:57	and integrating out also didn't user and different upstream ml models are in the reinforcement
0:23:02	learning is useful
0:23:06	any questions
0:23:08	okay
0:23:31	yes so that's a good question so we basically we do you have like a
0:23:37	different surface form in this kind of designing this
0:23:40	strategy
0:23:41	this is actually our future work we wanted to actually to see how can we
0:23:46	generate sentences was pragmatics inside of it
0:23:48	right now it's some is based on some templates
0:23:52	so basically we tried to use different were in different worrying about
0:23:56	it is still templates not really a very general
0:24:02	jury
0:24:18	and that's a good a question so here the idea we trying to say that
0:24:22	we trying to integrate as much as
0:24:25	their uncertainty of the conversation into the dialogue planning definitely of all these kind of
0:24:30	where two vector
0:24:32	is also an extra information can get into their own strategy selection
0:24:36	or a star for if you're spoken dialogue system asr a error is
0:24:41	so i think you definitely if you can optimize and considering all these uncertainties instead
0:24:46	of the dialogue system we would be better
0:24:48	but we haven't done that yet
0:24:52	you much states
0:24:55	basically it there
0:24:56	it's like expansion and the space will expanding exponentially of you had a more variables
0:25:03	and their
0:25:08	any other questions
0:25:30	o
0:25:30	and that's a good question so basically we ask the user very so we just
0:25:35	give the using with respect to user's utterance do thing
0:25:40	the response is appropriate coherent no not
0:25:43	so sometimes people think or if they're changing topic is kind of right on time
0:25:48	they think it's appropriate
0:25:50	if it's not
0:25:51	and they would think it saying appropriate
0:25:54	there is totally we give them pretty broad interpretation of how appropriate it is
0:25:59	so a lot of people do you pick context into consideration what they're waving them
0:26:08	true
0:26:09	true
0:26:25	pretty well pretty right so that's why we try to in the reward function we
0:26:30	try to and come for the variety as well
0:26:34	in the optimisation function zero basically
0:26:38	appropriateness is like a one aspect of making the system communicable
0:26:44	and the others make a being a file on there being provocative or anything else
0:26:48	could be add up on that
0:26:50	so i think it's like a different your inbox
0:26:52	and their variety or personalisation the something could be considered
0:27:05	i

Strategy and Policy Learning for Non-Task-Oriented Conversational Systems

Oral Session 7: Non-task-oriented dialogue systems

Zhou Yu, Ziyu Xu, Alan W Black and Alexander Rudnicky