Speech Transcript - Using Reinforcement Learning to Model Incrementality in a Fast-Paced Dialogue Game

0:00:15	right
0:00:18	for the last talk of the session
0:00:20	i try to keep it fun that a lot of videos it's on again it's
0:00:23	gonna be
0:00:25	find an engaging
0:00:27	so this work is
0:00:29	on using reinforcement learning for modeling incrementality in the context of a fast paced dialogue
0:00:37	game
0:00:38	this might work with my advisers david you want and can right hand i most
0:00:41	wanted to see how to
0:00:43	so
0:00:44	incrementality that's what this work is focused on
0:00:48	a human speech as incremental right
0:00:51	so we process the of be processed the content what whiteboard and sometimes and subword
0:00:56	but we try to process it as soon as the things are available
0:00:59	so incrementality helps us model are different natural language
0:01:04	natural dialogue phenomena such as that record a game
0:01:08	speech overlaps barge ins backchannels so modeling these things is very important to make a
0:01:14	dialogue systems more natural and efficient
0:01:18	so the contributions of this work
0:01:21	can be kind of grouped into three points
0:01:24	the first one is the white of reinforcement learning method to model incrementality
0:01:29	the second one is we provide a detailed analysis and what does it need a
0:01:34	so in our previous work we have a very
0:01:37	you know like state-of-the-art carefully designed rule based a baseline system
0:01:41	which interacts with humans in real time
0:01:44	and it performs nearly as well as humans right so it's a it's a really
0:01:48	strong baseline
0:01:50	the actual the videos and you get more context in the slides to calm
0:01:55	the reinforcement learning model introduce your actually outperforms
0:02:00	the cd a baseline but please keep in mind it's offline also we don't have
0:02:04	a real system yet but it's a lot of forms
0:02:08	and we also provide some analysis of the fourteen time it took the double up
0:02:12	to six
0:02:13	selected domain so the domain as a adapted dialogue again we call it already image
0:02:19	it's a two player game
0:02:22	it's an image matching collaborative game between two players so each person
0:02:26	is assigned a little farther detector or the matcher
0:02:29	the detector sees the eight images as you see on the screen you're one of
0:02:34	them is highlighted with the red border the data that is supposed to describe this
0:02:38	one
0:02:38	no matter sees the same eight images in different orders and to match it is
0:02:42	supposed to make the selection based on the description given
0:02:45	right and the goal is to get
0:02:47	as many may just as possible in a lot of fine
0:02:50	so that's but you know like it's fast and incrementally
0:02:55	so let's look at an example as to how this gain works so we also
0:03:01	so here you see are two players human begin with one another other person on
0:03:06	the top is the detector so that the listing one of the image highlighted with
0:03:10	the red border sure despite the highlighted image other person below is the matcher
0:03:15	should try to guess the image based on the description
0:03:18	and there's time and score so depend i o for that
0:03:24	part
0:03:26	our classes that
0:03:30	it offline such a while it might here we got it
0:03:35	it's the line classes but
0:03:38	okay so in this in this particular a been used for like
0:03:43	the dialogue is very fast an incremental there's lot of rapid turn taking
0:03:46	so it's very is it's a fast-paced game it's that's fun
0:03:52	so we got a lot of data in the from the human conversation and then
0:03:56	we design an incremental a agent
0:04:00	so we been employed eve
0:04:02	she's high performance of baseline system right
0:04:05	so she's trained on the human conversation data and provide more details in the coming
0:04:10	slide
0:04:11	we evaluated this but one hundred twenty five users
0:04:14	and this performs nearly as well as the humans
0:04:18	so it should work
0:04:19	so this medium a show you how the interaction between the even the human goes
0:04:24	this of your pain again right
0:04:26	so the top
0:04:27	or you will see eight images what the human but c and on the bottom
0:04:31	you see he's
0:04:32	eight images and using clean boss going up and all its basically high confidence
0:04:38	in each images and their changing based on our the human descriptions
0:04:45	okay
0:04:50	it's a yellow bird
0:04:51	time
0:04:55	so
0:04:56	okay
0:05:00	and sleeping black and white cat
0:05:02	okay
0:05:07	sounds which one
0:05:09	bernstein
0:05:15	clean
0:05:15	by which can handle bars and c
0:05:19	okay
0:05:20	alright so that so just playing the game but humans it's real time and she
0:05:24	is not a just one begin with
0:05:26	so
0:05:28	how does she what so basically we have the user's speech coming and we have
0:05:32	an incremental asr which is quality
0:05:34	we just providing like it's one best hypothesis every hundred milliseconds
0:05:39	we use this hypothesis and we compute are the distribution of confidence distribution of course
0:05:46	all the eight images that's on the screen
0:05:49	and then the dialogue policy uses the distributions and then she decides whether way
0:05:56	or to select or to skip
0:05:58	the wait action is
0:05:59	she silent then she's the listening
0:06:01	selection is bad
0:06:03	she has enough confidence to you know like make the selection and skip this where
0:06:07	she's thinking hey i'm not getting much of you know like information maybe i just
0:06:11	skip and go to the next one and hope i get that right
0:06:15	the thai language generation is very simple it's templatebased of innovation select
0:06:19	she says got it has you heard in the video
0:06:21	if you skipping she says that someone
0:06:25	now the focus of this work is the dialogue policy so the dialogue policy in
0:06:30	the previous work you see the a task is to design rules
0:06:33	a fairly wide we call it carefully designed rules in the minute
0:06:39	we thought that can we do better
0:06:41	than the current baseline
0:06:42	right
0:06:43	so we use reinforcement learning and we try to see if either that perform better
0:06:50	so
0:06:52	the carefully designed baseline has these things we just be start b is the first
0:06:55	one which is how yes
0:06:57	probability a sign
0:06:58	do you know like a
0:07:01	are two one update images
0:07:02	and then there are two values which is identification threshold and the give-up threshold so
0:07:07	identification threshold is
0:07:09	the minimum confidence that should hit for at any given image able which if can
0:07:15	say got it
0:07:17	and give-up threshold is
0:07:19	the maximum time she rates after which
0:07:23	rc say skip so any time in between she's waiting right
0:07:27	so this is what is the carefully design a hurdles baseline system is
0:07:33	why do we call it carefully designed a rules right
0:07:37	so in the published comparisons of auditing policies and the rule based one thing that
0:07:42	of a nokia like how much fine was actually
0:07:45	on designing the rule based a systems
0:07:48	so in this work
0:07:49	you know we do that so
0:07:52	the information so identification threshold it and the give-up threshold
0:07:55	g is actually
0:07:58	not some random value that depict it's actually train i mean it's tomb from the
0:08:03	field human conversation data
0:08:06	we use something called eavesdropper a framework and we use this to get it send
0:08:10	g d's
0:08:11	for more details please refer to a paper in this is in fact that really
0:08:15	fifteen
0:08:16	so we spent almost one month and you know like trying to find
0:08:20	not like what's the best way to design this policies
0:08:23	so predicting the next word is one such examples so
0:08:28	so it what it looks something for we designed this rules and actually performs nearly
0:08:32	as well as human so it's a really strong baseline
0:08:35	but even though it is carefully designed rules she still
0:08:39	you know a few limitations so it's group best case one case twenty gives three
0:08:46	so in this
0:08:47	in this particular slide you see on the x-axis is fine right as the bangles
0:08:53	the y-axis is the confidence
0:08:54	of assigned by the nlu
0:08:57	so each one of the points is each partial that's coming again
0:09:03	from the asr so confidence is actually changing
0:09:06	in the case one
0:09:08	eve is very eager to skip
0:09:10	right
0:09:11	in the case to she's very eager to select
0:09:14	so sometimes what happens but postures incremental speech recognition is that we have a lot
0:09:19	of you know like unstable
0:09:22	a hypothesis and it softens leading to kind of spoken to that and comment
0:09:27	and the case to be as bad
0:09:29	actually save time
0:09:30	by maybe selecting you know like much with your right
0:09:34	so these other three cases where it's hard you've can actually perform
0:09:39	so we use reinforcement learning
0:09:41	so the state space
0:09:43	is represented by a people that is used r t which is the highest confidence
0:09:48	in any one of the eight images
0:09:51	and then p c is the and consumed
0:09:53	right so that what happened in nature
0:09:56	the action is basically is it select
0:09:59	is it skip or is it weight
0:10:02	or maybe transition probabilities and analyses hundred factors and the reward is very simplistic
0:10:07	that is if you gets
0:10:09	the image right
0:10:10	she gets a reward of plus hundred if you gets a problem it's a negative
0:10:13	hundred
0:10:14	i the weight is like a very small epsilon value
0:10:18	it's very close to zero
0:10:19	and she gets more reward for skipping
0:10:22	so
0:10:23	the data that we use for this experiment
0:10:26	how comes in three flavours
0:10:28	the human data in the lab that we collected
0:10:31	the human web interaction data collected in another experiment
0:10:35	and then the eve interactions with other humans
0:10:39	the one twenty five that i was talking about so there are a thirteen thousand
0:10:43	but more than thirteen thousand subdialogues here
0:10:45	so we split them up
0:10:47	based on the user's the ninety percent of the users
0:10:50	of used for training and the ten percent of for testing
0:10:54	for reinforcement learning the user this be a by a describe what is iterations and
0:10:58	user question
0:11:00	a radial basis functions for representing are the features
0:11:04	so how so how does how does it operate
0:11:06	so every time hundred milliseconds asr is given out the partials
0:11:11	we start is assigned by the nlu
0:11:14	and the policies deciding whether actual rate
0:11:16	or select or skip
0:11:18	if its weight
0:11:20	the next it is i can always samples the next time step that is two
0:11:24	hundred millisecond what happened after the second
0:11:26	day sidebar to the
0:11:28	the new value for the nist rt and the new policy of our decision
0:11:32	so this keep happening until we see a selection of the skipped if it selections
0:11:37	of the scale
0:11:38	we know the ground truth so we can do like a fine on the values
0:11:42	based on that
0:11:44	so this is a snapshot as to how the things able right
0:11:47	so on the x-axis you see the partials
0:11:49	so each one of those more i need things of the partials that's coming in
0:11:52	from the asr
0:11:54	and on the y-axis you see the confidence assigned by a the nlu
0:11:59	so in this example you see the baseline the agenda skipping at the point
0:12:04	and then the rl agent that strangers actually waiting for a longer time until she
0:12:09	sees like very high confidence and hands should get stuck image right
0:12:14	okay so i want a text
0:12:15	a little time to explain what describe
0:12:17	it's a not so instead of
0:12:21	so on the horizontal axis you see three groups
0:12:24	wait actions
0:12:26	in the middle you see skip actions
0:12:28	and on the right you see the select actions right
0:12:31	so this graph shows the complete state space it's everything in the state
0:12:36	the red dots
0:12:37	indicate the baseline agent decisions
0:12:40	the blue dots is what was learned by the reinforcement learning policy
0:12:45	on the vertical axis you see that i'm going from zero to fifteen
0:12:50	and on the data taxes you see the confidence going from point to one
0:12:56	so
0:12:57	the red dots can see that actually fit together it's you know like it's a
0:13:00	rule based system right so we can be determinized are deterministically no you know like
0:13:05	what action any changes taking
0:13:07	the blue is the actions that's learned by the reinforcement learning so she's kind of
0:13:14	the learning similar things but there's some difference you
0:13:17	that is the reinforcement learning policies learning actually select an image for very high confidence
0:13:23	for extremely high confidence that is one point zero
0:13:26	if the time consumed is low
0:13:27	so if the time consuming is not solo
0:13:31	she's actually learning wait more
0:13:34	so by creating more she's actually getting more partials that since you like she's getting
0:13:38	more boxes as a result of which are she has a chance of performing better
0:13:42	in the game and hence quality more points
0:13:44	so this so this graph is kind of i showing that
0:13:49	this is more simple
0:13:52	so on the x-axis you see the average point score for one of the subset
0:13:56	and on the y-axis you see that underpins you
0:13:59	so the blue one is the reinforcement learning the red one as the baseline agent
0:14:03	so you see the agent is actually waiting for longer time
0:14:07	and she's coding more points and on the vertical axis you have you know like
0:14:11	the baseline system which is very hard to know like a skit or you know
0:14:16	like make a selection
0:14:17	so here we have that is so here we have a religion significantly scoring more
0:14:22	points than the baseline
0:14:24	and
0:14:25	there's a trend which is actually performing she's taking more fine to make the selections
0:14:31	so why
0:14:32	did the cd a baseline couldn't loan or reinforcement learning don't right
0:14:38	so
0:14:39	you see that all is if you're gonna come back to the policy that we
0:14:42	used for the cd a baseline
0:14:45	it's actually independent the time and the confidence values of the copy start be
0:14:50	you know independent of each other
0:14:53	but what reinforcement learning is doing is it's actually learning to optimize
0:14:58	the policy based on nist rt and the time consuming and jointly
0:15:03	and back results in
0:15:05	the reinforcement learning agent performing much better than the baseline agent
0:15:12	so this shows like
0:15:14	you know how much points
0:15:16	she's score
0:15:17	and b s is the points per second
0:15:20	like it kind of combines both the points and dot fine you know like aspect
0:15:24	in one particular
0:15:25	a table
0:15:26	so you can think consistently you know like i just putting much higher in terms
0:15:31	of points you know like across all the image sets
0:15:34	but the points per second a something that you're can so that that's of you
0:15:38	know like interest
0:15:40	have and how it twice
0:15:42	so in the by cc the points per second as zero point zero nine and
0:15:46	in the rls zero point one for that means
0:15:49	by scoring more points
0:15:51	she's actually don't better in the game
0:15:53	because her points per second has been lot higher
0:15:56	and in the necklace
0:15:58	subset
0:15:59	we see that even if the baseline agent has scored much less points
0:16:03	the points for second is very high
0:16:05	that's because
0:16:06	the baseline agent is very you got the one i just one some points by
0:16:10	chance
0:16:12	but rl is getting what points
0:16:15	basically by rating more as a result of which are b bs s one
0:16:20	so i want to discuss a little bit about if for and
0:16:24	the time
0:16:25	so that they systems are often criticised as being laborious and time-consuming the bit
0:16:31	they are but they actually have doesn't perform nearly as well as human so i
0:16:37	don't know if it's favouritism
0:16:40	and you know it nearly the same time the better the cd a baseline asked
0:16:44	to reinforcement learning policy no this is of course excluding
0:16:47	the data collection and the intersection building efforts
0:16:51	but
0:16:51	the advantage that we get is that rl approach is more scalable
0:16:56	because adding features is more easy and
0:17:03	so the future work exactly one
0:17:05	two
0:17:06	actually investigative are the best improvements transfer to the interactions which means we want to
0:17:12	you know like
0:17:13	put the policy of the reinforcement learning policy to the agent and see if you
0:17:17	can actually perform better
0:17:19	in the real user study
0:17:21	and then we want to explore adding more features or to the state space and
0:17:25	then
0:17:26	the reward function one alone how from the data using other in four the inverse
0:17:30	reinforcement learning
0:17:33	and finally thank you wanna time mike about the
0:17:38	and anonymous reviewers for their very useful comments and nsf and additive for supporting this
0:17:43	work
0:17:43	and this people for providing a images a second using this a particular paper
0:17:50	thank you very much so they questions
0:17:57	very much and now time for questions
0:18:01	take here so
0:18:06	i think you very much for a nice talk and just a clarification question regarding
0:18:11	a room for reinforcement learning setup c four i'm correct your learning from a corpus
0:18:17	right yep but here the using least squares for quality duration
0:18:22	but easy and onpolicy method which requires learning from interaction righted in learning from corpus
0:18:31	alright so
0:18:33	so this is the one of which expand so we kind of three this as
0:18:38	a real interaction
0:18:39	that is even though it's you know like
0:18:42	so for every hundred milliseconds as it would happen in a real infractions but user
0:18:47	or subdialogue
0:18:49	be kind of sample like based on each time-step rate for every for the first
0:18:54	hundred milliseconds we have a partial and for the first partial we have a small
0:18:57	we have the probability distribution that the fine and we have that and consume
0:19:01	so here we just use the probability distribution and the time
0:19:04	not like is in a like as a feature
0:19:07	and then
0:19:08	the next time sample hasn't happened in a real interaction the next thing that's happening
0:19:13	is
0:19:13	the next question is coming in
0:19:15	and the next part show that you know like is something that the user task
0:19:18	actually spoken you know like in the data that you're collector
0:19:23	and
0:19:24	you know like it keeps going on so basically it's train
0:19:27	but subdialogue
0:19:29	a image
0:19:32	but i still think you would gets improvement if you actually used something like importance
0:19:37	sampling
0:19:38	count the fact that you're tree very seeing a trajectory that happens in corpus project
0:19:46	and in an online exploration method which on policy reinforcement learning
0:19:55	i
0:19:58	that that's a good question i mean i have explored a bar
0:20:02	i guess that's according to you know like have explored
0:20:06	include
0:20:20	fix for talk all two questions first one can you explain a little bit more
0:20:23	how you can point you know you work with image recognition for the same using
0:20:28	some cnn model
0:20:30	we fix the vision
0:20:31	what we fake vision
0:20:33	okay so the nlu a sign
0:20:35	so the way the nlu strain does
0:20:37	we have the human data that we have collected that humans are actually describing the
0:20:42	corpus right
0:20:43	so we had other descriptions from the human examples where to humans was speaking and
0:20:49	they were describing the target image
0:20:50	so we had
0:20:51	the words that's associated image
0:20:54	so we
0:20:55	that's that i mean like that something that we really want to do that is
0:20:59	you know like user you images and then of get like
0:21:03	learn from the image rather than fig that we should but in this particular one
0:21:07	bit just learning from
0:21:08	doctor and did you play around with setting like actually do the work for we
0:21:12	do actually to be negative so you might speed up the so we tried like
0:21:17	a lot of different things so one thing is very and start of that lspi
0:21:22	in the beginning like the start of but you know like all the different algorithms
0:21:25	like
0:21:27	one of the example is we try to q-learning but you know it
0:21:31	lot more us samples like that if there was prosody and really trial negative it
0:21:39	what's for the weight actions but that would mean the agent this actually
0:21:44	in a like penalized weight but we don't really want back rate we want agent
0:21:48	to be assigned with higher rewards for doing well in the game rather than waiting
0:21:53	or you know like
0:21:54	the specific reward function manipulation we just one
0:21:58	i mean
0:21:58	the reward function is kind of
0:22:00	reflective of what's happening in the game
0:22:03	more points for
0:22:08	and flexible that well i just one the let us try switching the roles of
0:22:13	human the most we need the game like what would happen i
0:22:17	the machine have has to describe the actions so we v so currently
0:22:25	the agent is only in the matcher role
0:22:28	it's not playing the role of the director it becomes much more complex because we
0:22:32	have to incrementally generate the descriptions but that's something that we really want to know
0:22:36	like in
0:22:37	in the future work
0:22:39	we don't know how
0:22:44	thanks room is talking or just a quick question about this is the representation so
0:22:48	and four for purely from is the so you're putting the portals in this the
0:22:53	yes
0:22:56	the partials
0:22:57	no the state just have those the you know you just an idea okay so
0:23:01	you're not you're not like a morning instability
0:23:03	right okay should not being on the captured okay this portals like always talk a
0:23:07	bicycle but it's not it's like this you know be or something like that so
0:23:10	we can be faster if you put it was able you know the colonial wasn't
0:23:14	as the you're okay the partials right even its ability to learn you could learn
0:23:17	the most because in the original consistent
0:23:22	that is right i wanna shall one small taking your with the instability in the
0:23:27	case to you know like and then use course in a later let's because of
0:23:32	this instability
0:23:33	and what we actually want is and what actually happened in the game is
0:23:40	that is not like the nlu confidence
0:23:42	is actually you know like fluctuating
0:23:45	part i and nlu confidence you know like all these blips of
0:23:49	these things
0:23:50	you know like it's kind of
0:23:52	lower as a joint way of like probability and the fine so it's actually waiting
0:23:56	a lot more giving the chance to kind of the last but that's of a
0:24:00	question i mean we i mean i think if you if you had a more
0:24:02	information use the nothing you warning would probably be more successful "'cause" i think that
0:24:06	it's possible your maybe why we use of the dp weak assumptions little bit
0:24:11	so
0:24:13	adding more features to spit that's and
0:24:20	right
0:24:20	thank you thank you think we want to thank him speaker once again

Using Reinforcement Learning to Model Incrementality in a Fast-Paced Dialogue Game

Oral Session 2: Turn-Taking and Real-Time Interaction

Ramesh Manuvinakurike, David DeVault and Kallirroi Georgila