0:00:15right
0:00:18for the last talk of the session
0:00:20i try to keep it fun that a lot of videos it's on again it's
0:00:23gonna be
0:00:25find an engaging
0:00:27so this work is
0:00:29on using reinforcement learning for modeling incrementality in the context of a fast paced dialogue
0:00:37game
0:00:38this might work with my advisers david you want and can right hand i most
0:00:41wanted to see how to
0:00:43so
0:00:44incrementality that's what this work is focused on
0:00:48a human speech as incremental right
0:00:51so we process the of be processed the content what whiteboard and sometimes and subword
0:00:56but we try to process it as soon as the things are available
0:00:59so incrementality helps us model are different natural language
0:01:04natural dialogue phenomena such as that record a game
0:01:08speech overlaps barge ins backchannels so modeling these things is very important to make a
0:01:14dialogue systems more natural and efficient
0:01:18so the contributions of this work
0:01:21can be kind of grouped into three points
0:01:24the first one is the white of reinforcement learning method to model incrementality
0:01:29the second one is we provide a detailed analysis and what does it need a
0:01:34so in our previous work we have a very
0:01:37you know like state-of-the-art carefully designed rule based a baseline system
0:01:41which interacts with humans in real time
0:01:44and it performs nearly as well as humans right so it's a it's a really
0:01:48strong baseline
0:01:50the actual the videos and you get more context in the slides to calm
0:01:55the reinforcement learning model introduce your actually outperforms
0:02:00the cd a baseline but please keep in mind it's offline also we don't have
0:02:04a real system yet but it's a lot of forms
0:02:08and we also provide some analysis of the fourteen time it took the double up
0:02:12to six
0:02:13selected domain so the domain as a adapted dialogue again we call it already image
0:02:19it's a two player game
0:02:22it's an image matching collaborative game between two players so each person
0:02:26is assigned a little farther detector or the matcher
0:02:29the detector sees the eight images as you see on the screen you're one of
0:02:34them is highlighted with the red border the data that is supposed to describe this
0:02:38one
0:02:38no matter sees the same eight images in different orders and to match it is
0:02:42supposed to make the selection based on the description given
0:02:45right and the goal is to get
0:02:47as many may just as possible in a lot of fine
0:02:50so that's but you know like it's fast and incrementally
0:02:55so let's look at an example as to how this gain works so we also
0:03:01so here you see are two players human begin with one another other person on
0:03:06the top is the detector so that the listing one of the image highlighted with
0:03:10the red border sure despite the highlighted image other person below is the matcher
0:03:15should try to guess the image based on the description
0:03:18and there's time and score so depend i o for that
0:03:24part
0:03:26our classes that
0:03:30it offline such a while it might here we got it
0:03:35it's the line classes but
0:03:38okay so in this in this particular a been used for like
0:03:43the dialogue is very fast an incremental there's lot of rapid turn taking
0:03:46so it's very is it's a fast-paced game it's that's fun
0:03:52so we got a lot of data in the from the human conversation and then
0:03:56we design an incremental a agent
0:04:00so we been employed eve
0:04:02she's high performance of baseline system right
0:04:05so she's trained on the human conversation data and provide more details in the coming
0:04:10slide
0:04:11we evaluated this but one hundred twenty five users
0:04:14and this performs nearly as well as the humans
0:04:18so it should work
0:04:19so this medium a show you how the interaction between the even the human goes
0:04:24this of your pain again right
0:04:26so the top
0:04:27or you will see eight images what the human but c and on the bottom
0:04:31you see he's
0:04:32eight images and using clean boss going up and all its basically high confidence
0:04:38in each images and their changing based on our the human descriptions
0:04:45okay
0:04:50it's a yellow bird
0:04:51time
0:04:55so
0:04:56okay
0:05:00and sleeping black and white cat
0:05:02okay
0:05:07sounds which one
0:05:09bernstein
0:05:15clean
0:05:15by which can handle bars and c
0:05:19okay
0:05:20alright so that so just playing the game but humans it's real time and she
0:05:24is not a just one begin with
0:05:26so
0:05:28how does she what so basically we have the user's speech coming and we have
0:05:32an incremental asr which is quality
0:05:34we just providing like it's one best hypothesis every hundred milliseconds
0:05:39we use this hypothesis and we compute are the distribution of confidence distribution of course
0:05:46all the eight images that's on the screen
0:05:49and then the dialogue policy uses the distributions and then she decides whether way
0:05:56or to select or to skip
0:05:58the wait action is
0:05:59she silent then she's the listening
0:06:01selection is bad
0:06:03she has enough confidence to you know like make the selection and skip this where
0:06:07she's thinking hey i'm not getting much of you know like information maybe i just
0:06:11skip and go to the next one and hope i get that right
0:06:15the thai language generation is very simple it's templatebased of innovation select
0:06:19she says got it has you heard in the video
0:06:21if you skipping she says that someone
0:06:25now the focus of this work is the dialogue policy so the dialogue policy in
0:06:30the previous work you see the a task is to design rules
0:06:33a fairly wide we call it carefully designed rules in the minute
0:06:39we thought that can we do better
0:06:41than the current baseline
0:06:42right
0:06:43so we use reinforcement learning and we try to see if either that perform better
0:06:50so
0:06:52the carefully designed baseline has these things we just be start b is the first
0:06:55one which is how yes
0:06:57probability a sign
0:06:58do you know like a
0:07:01are two one update images
0:07:02and then there are two values which is identification threshold and the give-up threshold so
0:07:07identification threshold is
0:07:09the minimum confidence that should hit for at any given image able which if can
0:07:15say got it
0:07:17and give-up threshold is
0:07:19the maximum time she rates after which
0:07:23rc say skip so any time in between she's waiting right
0:07:27so this is what is the carefully design a hurdles baseline system is
0:07:33why do we call it carefully designed a rules right
0:07:37so in the published comparisons of auditing policies and the rule based one thing that
0:07:42of a nokia like how much fine was actually
0:07:45on designing the rule based a systems
0:07:48so in this work
0:07:49you know we do that so
0:07:52the information so identification threshold it and the give-up threshold
0:07:55g is actually
0:07:58not some random value that depict it's actually train i mean it's tomb from the
0:08:03field human conversation data
0:08:06we use something called eavesdropper a framework and we use this to get it send
0:08:10g d's
0:08:11for more details please refer to a paper in this is in fact that really
0:08:15fifteen
0:08:16so we spent almost one month and you know like trying to find
0:08:20not like what's the best way to design this policies
0:08:23so predicting the next word is one such examples so
0:08:28so it what it looks something for we designed this rules and actually performs nearly
0:08:32as well as human so it's a really strong baseline
0:08:35but even though it is carefully designed rules she still
0:08:39you know a few limitations so it's group best case one case twenty gives three
0:08:46so in this
0:08:47in this particular slide you see on the x-axis is fine right as the bangles
0:08:53the y-axis is the confidence
0:08:54of assigned by the nlu
0:08:57so each one of the points is each partial that's coming again
0:09:03from the asr so confidence is actually changing
0:09:06in the case one
0:09:08eve is very eager to skip
0:09:10right
0:09:11in the case to she's very eager to select
0:09:14so sometimes what happens but postures incremental speech recognition is that we have a lot
0:09:19of you know like unstable
0:09:22a hypothesis and it softens leading to kind of spoken to that and comment
0:09:27and the case to be as bad
0:09:29actually save time
0:09:30by maybe selecting you know like much with your right
0:09:34so these other three cases where it's hard you've can actually perform
0:09:39so we use reinforcement learning
0:09:41so the state space
0:09:43is represented by a people that is used r t which is the highest confidence
0:09:48in any one of the eight images
0:09:51and then p c is the and consumed
0:09:53right so that what happened in nature
0:09:56the action is basically is it select
0:09:59is it skip or is it weight
0:10:02or maybe transition probabilities and analyses hundred factors and the reward is very simplistic
0:10:07that is if you gets
0:10:09the image right
0:10:10she gets a reward of plus hundred if you gets a problem it's a negative
0:10:13hundred
0:10:14i the weight is like a very small epsilon value
0:10:18it's very close to zero
0:10:19and she gets more reward for skipping
0:10:22so
0:10:23the data that we use for this experiment
0:10:26how comes in three flavours
0:10:28the human data in the lab that we collected
0:10:31the human web interaction data collected in another experiment
0:10:35and then the eve interactions with other humans
0:10:39the one twenty five that i was talking about so there are a thirteen thousand
0:10:43but more than thirteen thousand subdialogues here
0:10:45so we split them up
0:10:47based on the user's the ninety percent of the users
0:10:50of used for training and the ten percent of for testing
0:10:54for reinforcement learning the user this be a by a describe what is iterations and
0:10:58user question
0:11:00a radial basis functions for representing are the features
0:11:04so how so how does how does it operate
0:11:06so every time hundred milliseconds asr is given out the partials
0:11:11we start is assigned by the nlu
0:11:14and the policies deciding whether actual rate
0:11:16or select or skip
0:11:18if its weight
0:11:20the next it is i can always samples the next time step that is two
0:11:24hundred millisecond what happened after the second
0:11:26day sidebar to the
0:11:28the new value for the nist rt and the new policy of our decision
0:11:32so this keep happening until we see a selection of the skipped if it selections
0:11:37of the scale
0:11:38we know the ground truth so we can do like a fine on the values
0:11:42based on that
0:11:44so this is a snapshot as to how the things able right
0:11:47so on the x-axis you see the partials
0:11:49so each one of those more i need things of the partials that's coming in
0:11:52from the asr
0:11:54and on the y-axis you see the confidence assigned by a the nlu
0:11:59so in this example you see the baseline the agenda skipping at the point
0:12:04and then the rl agent that strangers actually waiting for a longer time until she
0:12:09sees like very high confidence and hands should get stuck image right
0:12:14okay so i want a text
0:12:15a little time to explain what describe
0:12:17it's a not so instead of
0:12:21so on the horizontal axis you see three groups
0:12:24wait actions
0:12:26in the middle you see skip actions
0:12:28and on the right you see the select actions right
0:12:31so this graph shows the complete state space it's everything in the state
0:12:36the red dots
0:12:37indicate the baseline agent decisions
0:12:40the blue dots is what was learned by the reinforcement learning policy
0:12:45on the vertical axis you see that i'm going from zero to fifteen
0:12:50and on the data taxes you see the confidence going from point to one
0:12:56so
0:12:57the red dots can see that actually fit together it's you know like it's a
0:13:00rule based system right so we can be determinized are deterministically no you know like
0:13:05what action any changes taking
0:13:07the blue is the actions that's learned by the reinforcement learning so she's kind of
0:13:14the learning similar things but there's some difference you
0:13:17that is the reinforcement learning policies learning actually select an image for very high confidence
0:13:23for extremely high confidence that is one point zero
0:13:26if the time consumed is low
0:13:27so if the time consuming is not solo
0:13:31she's actually learning wait more
0:13:34so by creating more she's actually getting more partials that since you like she's getting
0:13:38more boxes as a result of which are she has a chance of performing better
0:13:42in the game and hence quality more points
0:13:44so this so this graph is kind of i showing that
0:13:49this is more simple
0:13:52so on the x-axis you see the average point score for one of the subset
0:13:56and on the y-axis you see that underpins you
0:13:59so the blue one is the reinforcement learning the red one as the baseline agent
0:14:03so you see the agent is actually waiting for longer time
0:14:07and she's coding more points and on the vertical axis you have you know like
0:14:11the baseline system which is very hard to know like a skit or you know
0:14:16like make a selection
0:14:17so here we have that is so here we have a religion significantly scoring more
0:14:22points than the baseline
0:14:24and
0:14:25there's a trend which is actually performing she's taking more fine to make the selections
0:14:31so why
0:14:32did the cd a baseline couldn't loan or reinforcement learning don't right
0:14:38so
0:14:39you see that all is if you're gonna come back to the policy that we
0:14:42used for the cd a baseline
0:14:45it's actually independent the time and the confidence values of the copy start be
0:14:50you know independent of each other
0:14:53but what reinforcement learning is doing is it's actually learning to optimize
0:14:58the policy based on nist rt and the time consuming and jointly
0:15:03and back results in
0:15:05the reinforcement learning agent performing much better than the baseline agent
0:15:12so this shows like
0:15:14you know how much points
0:15:16she's score
0:15:17and b s is the points per second
0:15:20like it kind of combines both the points and dot fine you know like aspect
0:15:24in one particular
0:15:25a table
0:15:26so you can think consistently you know like i just putting much higher in terms
0:15:31of points you know like across all the image sets
0:15:34but the points per second a something that you're can so that that's of you
0:15:38know like interest
0:15:40have and how it twice
0:15:42so in the by cc the points per second as zero point zero nine and
0:15:46in the rls zero point one for that means
0:15:49by scoring more points
0:15:51she's actually don't better in the game
0:15:53because her points per second has been lot higher
0:15:56and in the necklace
0:15:58subset
0:15:59we see that even if the baseline agent has scored much less points
0:16:03the points for second is very high
0:16:05that's because
0:16:06the baseline agent is very you got the one i just one some points by
0:16:10chance
0:16:12but rl is getting what points
0:16:15basically by rating more as a result of which are b bs s one
0:16:20so i want to discuss a little bit about if for and
0:16:24the time
0:16:25so that they systems are often criticised as being laborious and time-consuming the bit
0:16:31they are but they actually have doesn't perform nearly as well as human so i
0:16:37don't know if it's favouritism
0:16:40and you know it nearly the same time the better the cd a baseline asked
0:16:44to reinforcement learning policy no this is of course excluding
0:16:47the data collection and the intersection building efforts
0:16:51but
0:16:51the advantage that we get is that rl approach is more scalable
0:16:56because adding features is more easy and
0:17:03so the future work exactly one
0:17:05two
0:17:06actually investigative are the best improvements transfer to the interactions which means we want to
0:17:12you know like
0:17:13put the policy of the reinforcement learning policy to the agent and see if you
0:17:17can actually perform better
0:17:19in the real user study
0:17:21and then we want to explore adding more features or to the state space and
0:17:25then
0:17:26the reward function one alone how from the data using other in four the inverse
0:17:30reinforcement learning
0:17:33and finally thank you wanna time mike about the
0:17:38and anonymous reviewers for their very useful comments and nsf and additive for supporting this
0:17:43work
0:17:43and this people for providing a images a second using this a particular paper
0:17:50thank you very much so they questions
0:17:57very much and now time for questions
0:18:01take here so
0:18:06i think you very much for a nice talk and just a clarification question regarding
0:18:11a room for reinforcement learning setup c four i'm correct your learning from a corpus
0:18:17right yep but here the using least squares for quality duration
0:18:22but easy and onpolicy method which requires learning from interaction righted in learning from corpus
0:18:31alright so
0:18:33so this is the one of which expand so we kind of three this as
0:18:38a real interaction
0:18:39that is even though it's you know like
0:18:42so for every hundred milliseconds as it would happen in a real infractions but user
0:18:47or subdialogue
0:18:49be kind of sample like based on each time-step rate for every for the first
0:18:54hundred milliseconds we have a partial and for the first partial we have a small
0:18:57we have the probability distribution that the fine and we have that and consume
0:19:01so here we just use the probability distribution and the time
0:19:04not like is in a like as a feature
0:19:07and then
0:19:08the next time sample hasn't happened in a real interaction the next thing that's happening
0:19:13is
0:19:13the next question is coming in
0:19:15and the next part show that you know like is something that the user task
0:19:18actually spoken you know like in the data that you're collector
0:19:23and
0:19:24you know like it keeps going on so basically it's train
0:19:27but subdialogue
0:19:29a image
0:19:32but i still think you would gets improvement if you actually used something like importance
0:19:37sampling
0:19:38count the fact that you're tree very seeing a trajectory that happens in corpus project
0:19:46and in an online exploration method which on policy reinforcement learning
0:19:55i
0:19:58that that's a good question i mean i have explored a bar
0:20:02i guess that's according to you know like have explored
0:20:06include
0:20:20fix for talk all two questions first one can you explain a little bit more
0:20:23how you can point you know you work with image recognition for the same using
0:20:28some cnn model
0:20:30we fix the vision
0:20:31what we fake vision
0:20:33okay so the nlu a sign
0:20:35so the way the nlu strain does
0:20:37we have the human data that we have collected that humans are actually describing the
0:20:42corpus right
0:20:43so we had other descriptions from the human examples where to humans was speaking and
0:20:49they were describing the target image
0:20:50so we had
0:20:51the words that's associated image
0:20:54so we
0:20:55that's that i mean like that something that we really want to do that is
0:20:59you know like user you images and then of get like
0:21:03learn from the image rather than fig that we should but in this particular one
0:21:07bit just learning from
0:21:08doctor and did you play around with setting like actually do the work for we
0:21:12do actually to be negative so you might speed up the so we tried like
0:21:17a lot of different things so one thing is very and start of that lspi
0:21:22in the beginning like the start of but you know like all the different algorithms
0:21:25like
0:21:27one of the example is we try to q-learning but you know it
0:21:31lot more us samples like that if there was prosody and really trial negative it
0:21:39what's for the weight actions but that would mean the agent this actually
0:21:44in a like penalized weight but we don't really want back rate we want agent
0:21:48to be assigned with higher rewards for doing well in the game rather than waiting
0:21:53or you know like
0:21:54the specific reward function manipulation we just one
0:21:58i mean
0:21:58the reward function is kind of
0:22:00reflective of what's happening in the game
0:22:03more points for
0:22:08and flexible that well i just one the let us try switching the roles of
0:22:13human the most we need the game like what would happen i
0:22:17the machine have has to describe the actions so we v so currently
0:22:25the agent is only in the matcher role
0:22:28it's not playing the role of the director it becomes much more complex because we
0:22:32have to incrementally generate the descriptions but that's something that we really want to know
0:22:36like in
0:22:37in the future work
0:22:39we don't know how
0:22:44thanks room is talking or just a quick question about this is the representation so
0:22:48and four for purely from is the so you're putting the portals in this the
0:22:53yes
0:22:56the partials
0:22:57no the state just have those the you know you just an idea okay so
0:23:01you're not you're not like a morning instability
0:23:03right okay should not being on the captured okay this portals like always talk a
0:23:07bicycle but it's not it's like this you know be or something like that so
0:23:10we can be faster if you put it was able you know the colonial wasn't
0:23:14as the you're okay the partials right even its ability to learn you could learn
0:23:17the most because in the original consistent
0:23:22that is right i wanna shall one small taking your with the instability in the
0:23:27case to you know like and then use course in a later let's because of
0:23:32this instability
0:23:33and what we actually want is and what actually happened in the game is
0:23:40that is not like the nlu confidence
0:23:42is actually you know like fluctuating
0:23:45part i and nlu confidence you know like all these blips of
0:23:49these things
0:23:50you know like it's kind of
0:23:52lower as a joint way of like probability and the fine so it's actually waiting
0:23:56a lot more giving the chance to kind of the last but that's of a
0:24:00question i mean we i mean i think if you if you had a more
0:24:02information use the nothing you warning would probably be more successful "'cause" i think that
0:24:06it's possible your maybe why we use of the dp weak assumptions little bit
0:24:11so
0:24:13adding more features to spit that's and
0:24:20right
0:24:20thank you thank you think we want to thank him speaker once again