| 0:00:15 | right |
|---|
| 0:00:18 | for the last talk of the session |
|---|
| 0:00:20 | i try to keep it fun that a lot of videos it's on again it's |
|---|
| 0:00:23 | gonna be |
|---|
| 0:00:25 | find an engaging |
|---|
| 0:00:27 | so this work is |
|---|
| 0:00:29 | on using reinforcement learning for modeling incrementality in the context of a fast paced dialogue |
|---|
| 0:00:37 | game |
|---|
| 0:00:38 | this might work with my advisers david you want and can right hand i most |
|---|
| 0:00:41 | wanted to see how to |
|---|
| 0:00:43 | so |
|---|
| 0:00:44 | incrementality that's what this work is focused on |
|---|
| 0:00:48 | a human speech as incremental right |
|---|
| 0:00:51 | so we process the of be processed the content what whiteboard and sometimes and subword |
|---|
| 0:00:56 | but we try to process it as soon as the things are available |
|---|
| 0:00:59 | so incrementality helps us model are different natural language |
|---|
| 0:01:04 | natural dialogue phenomena such as that record a game |
|---|
| 0:01:08 | speech overlaps barge ins backchannels so modeling these things is very important to make a |
|---|
| 0:01:14 | dialogue systems more natural and efficient |
|---|
| 0:01:18 | so the contributions of this work |
|---|
| 0:01:21 | can be kind of grouped into three points |
|---|
| 0:01:24 | the first one is the white of reinforcement learning method to model incrementality |
|---|
| 0:01:29 | the second one is we provide a detailed analysis and what does it need a |
|---|
| 0:01:34 | so in our previous work we have a very |
|---|
| 0:01:37 | you know like state-of-the-art carefully designed rule based a baseline system |
|---|
| 0:01:41 | which interacts with humans in real time |
|---|
| 0:01:44 | and it performs nearly as well as humans right so it's a it's a really |
|---|
| 0:01:48 | strong baseline |
|---|
| 0:01:50 | the actual the videos and you get more context in the slides to calm |
|---|
| 0:01:55 | the reinforcement learning model introduce your actually outperforms |
|---|
| 0:02:00 | the cd a baseline but please keep in mind it's offline also we don't have |
|---|
| 0:02:04 | a real system yet but it's a lot of forms |
|---|
| 0:02:08 | and we also provide some analysis of the fourteen time it took the double up |
|---|
| 0:02:12 | to six |
|---|
| 0:02:13 | selected domain so the domain as a adapted dialogue again we call it already image |
|---|
| 0:02:19 | it's a two player game |
|---|
| 0:02:22 | it's an image matching collaborative game between two players so each person |
|---|
| 0:02:26 | is assigned a little farther detector or the matcher |
|---|
| 0:02:29 | the detector sees the eight images as you see on the screen you're one of |
|---|
| 0:02:34 | them is highlighted with the red border the data that is supposed to describe this |
|---|
| 0:02:38 | one |
|---|
| 0:02:38 | no matter sees the same eight images in different orders and to match it is |
|---|
| 0:02:42 | supposed to make the selection based on the description given |
|---|
| 0:02:45 | right and the goal is to get |
|---|
| 0:02:47 | as many may just as possible in a lot of fine |
|---|
| 0:02:50 | so that's but you know like it's fast and incrementally |
|---|
| 0:02:55 | so let's look at an example as to how this gain works so we also |
|---|
| 0:03:01 | so here you see are two players human begin with one another other person on |
|---|
| 0:03:06 | the top is the detector so that the listing one of the image highlighted with |
|---|
| 0:03:10 | the red border sure despite the highlighted image other person below is the matcher |
|---|
| 0:03:15 | should try to guess the image based on the description |
|---|
| 0:03:18 | and there's time and score so depend i o for that |
|---|
| 0:03:24 | part |
|---|
| 0:03:26 | our classes that |
|---|
| 0:03:30 | it offline such a while it might here we got it |
|---|
| 0:03:35 | it's the line classes but |
|---|
| 0:03:38 | okay so in this in this particular a been used for like |
|---|
| 0:03:43 | the dialogue is very fast an incremental there's lot of rapid turn taking |
|---|
| 0:03:46 | so it's very is it's a fast-paced game it's that's fun |
|---|
| 0:03:52 | so we got a lot of data in the from the human conversation and then |
|---|
| 0:03:56 | we design an incremental a agent |
|---|
| 0:04:00 | so we been employed eve |
|---|
| 0:04:02 | she's high performance of baseline system right |
|---|
| 0:04:05 | so she's trained on the human conversation data and provide more details in the coming |
|---|
| 0:04:10 | slide |
|---|
| 0:04:11 | we evaluated this but one hundred twenty five users |
|---|
| 0:04:14 | and this performs nearly as well as the humans |
|---|
| 0:04:18 | so it should work |
|---|
| 0:04:19 | so this medium a show you how the interaction between the even the human goes |
|---|
| 0:04:24 | this of your pain again right |
|---|
| 0:04:26 | so the top |
|---|
| 0:04:27 | or you will see eight images what the human but c and on the bottom |
|---|
| 0:04:31 | you see he's |
|---|
| 0:04:32 | eight images and using clean boss going up and all its basically high confidence |
|---|
| 0:04:38 | in each images and their changing based on our the human descriptions |
|---|
| 0:04:45 | okay |
|---|
| 0:04:50 | it's a yellow bird |
|---|
| 0:04:51 | time |
|---|
| 0:04:55 | so |
|---|
| 0:04:56 | okay |
|---|
| 0:05:00 | and sleeping black and white cat |
|---|
| 0:05:02 | okay |
|---|
| 0:05:07 | sounds which one |
|---|
| 0:05:09 | bernstein |
|---|
| 0:05:15 | clean |
|---|
| 0:05:15 | by which can handle bars and c |
|---|
| 0:05:19 | okay |
|---|
| 0:05:20 | alright so that so just playing the game but humans it's real time and she |
|---|
| 0:05:24 | is not a just one begin with |
|---|
| 0:05:26 | so |
|---|
| 0:05:28 | how does she what so basically we have the user's speech coming and we have |
|---|
| 0:05:32 | an incremental asr which is quality |
|---|
| 0:05:34 | we just providing like it's one best hypothesis every hundred milliseconds |
|---|
| 0:05:39 | we use this hypothesis and we compute are the distribution of confidence distribution of course |
|---|
| 0:05:46 | all the eight images that's on the screen |
|---|
| 0:05:49 | and then the dialogue policy uses the distributions and then she decides whether way |
|---|
| 0:05:56 | or to select or to skip |
|---|
| 0:05:58 | the wait action is |
|---|
| 0:05:59 | she silent then she's the listening |
|---|
| 0:06:01 | selection is bad |
|---|
| 0:06:03 | she has enough confidence to you know like make the selection and skip this where |
|---|
| 0:06:07 | she's thinking hey i'm not getting much of you know like information maybe i just |
|---|
| 0:06:11 | skip and go to the next one and hope i get that right |
|---|
| 0:06:15 | the thai language generation is very simple it's templatebased of innovation select |
|---|
| 0:06:19 | she says got it has you heard in the video |
|---|
| 0:06:21 | if you skipping she says that someone |
|---|
| 0:06:25 | now the focus of this work is the dialogue policy so the dialogue policy in |
|---|
| 0:06:30 | the previous work you see the a task is to design rules |
|---|
| 0:06:33 | a fairly wide we call it carefully designed rules in the minute |
|---|
| 0:06:39 | we thought that can we do better |
|---|
| 0:06:41 | than the current baseline |
|---|
| 0:06:42 | right |
|---|
| 0:06:43 | so we use reinforcement learning and we try to see if either that perform better |
|---|
| 0:06:50 | so |
|---|
| 0:06:52 | the carefully designed baseline has these things we just be start b is the first |
|---|
| 0:06:55 | one which is how yes |
|---|
| 0:06:57 | probability a sign |
|---|
| 0:06:58 | do you know like a |
|---|
| 0:07:01 | are two one update images |
|---|
| 0:07:02 | and then there are two values which is identification threshold and the give-up threshold so |
|---|
| 0:07:07 | identification threshold is |
|---|
| 0:07:09 | the minimum confidence that should hit for at any given image able which if can |
|---|
| 0:07:15 | say got it |
|---|
| 0:07:17 | and give-up threshold is |
|---|
| 0:07:19 | the maximum time she rates after which |
|---|
| 0:07:23 | rc say skip so any time in between she's waiting right |
|---|
| 0:07:27 | so this is what is the carefully design a hurdles baseline system is |
|---|
| 0:07:33 | why do we call it carefully designed a rules right |
|---|
| 0:07:37 | so in the published comparisons of auditing policies and the rule based one thing that |
|---|
| 0:07:42 | of a nokia like how much fine was actually |
|---|
| 0:07:45 | on designing the rule based a systems |
|---|
| 0:07:48 | so in this work |
|---|
| 0:07:49 | you know we do that so |
|---|
| 0:07:52 | the information so identification threshold it and the give-up threshold |
|---|
| 0:07:55 | g is actually |
|---|
| 0:07:58 | not some random value that depict it's actually train i mean it's tomb from the |
|---|
| 0:08:03 | field human conversation data |
|---|
| 0:08:06 | we use something called eavesdropper a framework and we use this to get it send |
|---|
| 0:08:10 | g d's |
|---|
| 0:08:11 | for more details please refer to a paper in this is in fact that really |
|---|
| 0:08:15 | fifteen |
|---|
| 0:08:16 | so we spent almost one month and you know like trying to find |
|---|
| 0:08:20 | not like what's the best way to design this policies |
|---|
| 0:08:23 | so predicting the next word is one such examples so |
|---|
| 0:08:28 | so it what it looks something for we designed this rules and actually performs nearly |
|---|
| 0:08:32 | as well as human so it's a really strong baseline |
|---|
| 0:08:35 | but even though it is carefully designed rules she still |
|---|
| 0:08:39 | you know a few limitations so it's group best case one case twenty gives three |
|---|
| 0:08:46 | so in this |
|---|
| 0:08:47 | in this particular slide you see on the x-axis is fine right as the bangles |
|---|
| 0:08:53 | the y-axis is the confidence |
|---|
| 0:08:54 | of assigned by the nlu |
|---|
| 0:08:57 | so each one of the points is each partial that's coming again |
|---|
| 0:09:03 | from the asr so confidence is actually changing |
|---|
| 0:09:06 | in the case one |
|---|
| 0:09:08 | eve is very eager to skip |
|---|
| 0:09:10 | right |
|---|
| 0:09:11 | in the case to she's very eager to select |
|---|
| 0:09:14 | so sometimes what happens but postures incremental speech recognition is that we have a lot |
|---|
| 0:09:19 | of you know like unstable |
|---|
| 0:09:22 | a hypothesis and it softens leading to kind of spoken to that and comment |
|---|
| 0:09:27 | and the case to be as bad |
|---|
| 0:09:29 | actually save time |
|---|
| 0:09:30 | by maybe selecting you know like much with your right |
|---|
| 0:09:34 | so these other three cases where it's hard you've can actually perform |
|---|
| 0:09:39 | so we use reinforcement learning |
|---|
| 0:09:41 | so the state space |
|---|
| 0:09:43 | is represented by a people that is used r t which is the highest confidence |
|---|
| 0:09:48 | in any one of the eight images |
|---|
| 0:09:51 | and then p c is the and consumed |
|---|
| 0:09:53 | right so that what happened in nature |
|---|
| 0:09:56 | the action is basically is it select |
|---|
| 0:09:59 | is it skip or is it weight |
|---|
| 0:10:02 | or maybe transition probabilities and analyses hundred factors and the reward is very simplistic |
|---|
| 0:10:07 | that is if you gets |
|---|
| 0:10:09 | the image right |
|---|
| 0:10:10 | she gets a reward of plus hundred if you gets a problem it's a negative |
|---|
| 0:10:13 | hundred |
|---|
| 0:10:14 | i the weight is like a very small epsilon value |
|---|
| 0:10:18 | it's very close to zero |
|---|
| 0:10:19 | and she gets more reward for skipping |
|---|
| 0:10:22 | so |
|---|
| 0:10:23 | the data that we use for this experiment |
|---|
| 0:10:26 | how comes in three flavours |
|---|
| 0:10:28 | the human data in the lab that we collected |
|---|
| 0:10:31 | the human web interaction data collected in another experiment |
|---|
| 0:10:35 | and then the eve interactions with other humans |
|---|
| 0:10:39 | the one twenty five that i was talking about so there are a thirteen thousand |
|---|
| 0:10:43 | but more than thirteen thousand subdialogues here |
|---|
| 0:10:45 | so we split them up |
|---|
| 0:10:47 | based on the user's the ninety percent of the users |
|---|
| 0:10:50 | of used for training and the ten percent of for testing |
|---|
| 0:10:54 | for reinforcement learning the user this be a by a describe what is iterations and |
|---|
| 0:10:58 | user question |
|---|
| 0:11:00 | a radial basis functions for representing are the features |
|---|
| 0:11:04 | so how so how does how does it operate |
|---|
| 0:11:06 | so every time hundred milliseconds asr is given out the partials |
|---|
| 0:11:11 | we start is assigned by the nlu |
|---|
| 0:11:14 | and the policies deciding whether actual rate |
|---|
| 0:11:16 | or select or skip |
|---|
| 0:11:18 | if its weight |
|---|
| 0:11:20 | the next it is i can always samples the next time step that is two |
|---|
| 0:11:24 | hundred millisecond what happened after the second |
|---|
| 0:11:26 | day sidebar to the |
|---|
| 0:11:28 | the new value for the nist rt and the new policy of our decision |
|---|
| 0:11:32 | so this keep happening until we see a selection of the skipped if it selections |
|---|
| 0:11:37 | of the scale |
|---|
| 0:11:38 | we know the ground truth so we can do like a fine on the values |
|---|
| 0:11:42 | based on that |
|---|
| 0:11:44 | so this is a snapshot as to how the things able right |
|---|
| 0:11:47 | so on the x-axis you see the partials |
|---|
| 0:11:49 | so each one of those more i need things of the partials that's coming in |
|---|
| 0:11:52 | from the asr |
|---|
| 0:11:54 | and on the y-axis you see the confidence assigned by a the nlu |
|---|
| 0:11:59 | so in this example you see the baseline the agenda skipping at the point |
|---|
| 0:12:04 | and then the rl agent that strangers actually waiting for a longer time until she |
|---|
| 0:12:09 | sees like very high confidence and hands should get stuck image right |
|---|
| 0:12:14 | okay so i want a text |
|---|
| 0:12:15 | a little time to explain what describe |
|---|
| 0:12:17 | it's a not so instead of |
|---|
| 0:12:21 | so on the horizontal axis you see three groups |
|---|
| 0:12:24 | wait actions |
|---|
| 0:12:26 | in the middle you see skip actions |
|---|
| 0:12:28 | and on the right you see the select actions right |
|---|
| 0:12:31 | so this graph shows the complete state space it's everything in the state |
|---|
| 0:12:36 | the red dots |
|---|
| 0:12:37 | indicate the baseline agent decisions |
|---|
| 0:12:40 | the blue dots is what was learned by the reinforcement learning policy |
|---|
| 0:12:45 | on the vertical axis you see that i'm going from zero to fifteen |
|---|
| 0:12:50 | and on the data taxes you see the confidence going from point to one |
|---|
| 0:12:56 | so |
|---|
| 0:12:57 | the red dots can see that actually fit together it's you know like it's a |
|---|
| 0:13:00 | rule based system right so we can be determinized are deterministically no you know like |
|---|
| 0:13:05 | what action any changes taking |
|---|
| 0:13:07 | the blue is the actions that's learned by the reinforcement learning so she's kind of |
|---|
| 0:13:14 | the learning similar things but there's some difference you |
|---|
| 0:13:17 | that is the reinforcement learning policies learning actually select an image for very high confidence |
|---|
| 0:13:23 | for extremely high confidence that is one point zero |
|---|
| 0:13:26 | if the time consumed is low |
|---|
| 0:13:27 | so if the time consuming is not solo |
|---|
| 0:13:31 | she's actually learning wait more |
|---|
| 0:13:34 | so by creating more she's actually getting more partials that since you like she's getting |
|---|
| 0:13:38 | more boxes as a result of which are she has a chance of performing better |
|---|
| 0:13:42 | in the game and hence quality more points |
|---|
| 0:13:44 | so this so this graph is kind of i showing that |
|---|
| 0:13:49 | this is more simple |
|---|
| 0:13:52 | so on the x-axis you see the average point score for one of the subset |
|---|
| 0:13:56 | and on the y-axis you see that underpins you |
|---|
| 0:13:59 | so the blue one is the reinforcement learning the red one as the baseline agent |
|---|
| 0:14:03 | so you see the agent is actually waiting for longer time |
|---|
| 0:14:07 | and she's coding more points and on the vertical axis you have you know like |
|---|
| 0:14:11 | the baseline system which is very hard to know like a skit or you know |
|---|
| 0:14:16 | like make a selection |
|---|
| 0:14:17 | so here we have that is so here we have a religion significantly scoring more |
|---|
| 0:14:22 | points than the baseline |
|---|
| 0:14:24 | and |
|---|
| 0:14:25 | there's a trend which is actually performing she's taking more fine to make the selections |
|---|
| 0:14:31 | so why |
|---|
| 0:14:32 | did the cd a baseline couldn't loan or reinforcement learning don't right |
|---|
| 0:14:38 | so |
|---|
| 0:14:39 | you see that all is if you're gonna come back to the policy that we |
|---|
| 0:14:42 | used for the cd a baseline |
|---|
| 0:14:45 | it's actually independent the time and the confidence values of the copy start be |
|---|
| 0:14:50 | you know independent of each other |
|---|
| 0:14:53 | but what reinforcement learning is doing is it's actually learning to optimize |
|---|
| 0:14:58 | the policy based on nist rt and the time consuming and jointly |
|---|
| 0:15:03 | and back results in |
|---|
| 0:15:05 | the reinforcement learning agent performing much better than the baseline agent |
|---|
| 0:15:12 | so this shows like |
|---|
| 0:15:14 | you know how much points |
|---|
| 0:15:16 | she's score |
|---|
| 0:15:17 | and b s is the points per second |
|---|
| 0:15:20 | like it kind of combines both the points and dot fine you know like aspect |
|---|
| 0:15:24 | in one particular |
|---|
| 0:15:25 | a table |
|---|
| 0:15:26 | so you can think consistently you know like i just putting much higher in terms |
|---|
| 0:15:31 | of points you know like across all the image sets |
|---|
| 0:15:34 | but the points per second a something that you're can so that that's of you |
|---|
| 0:15:38 | know like interest |
|---|
| 0:15:40 | have and how it twice |
|---|
| 0:15:42 | so in the by cc the points per second as zero point zero nine and |
|---|
| 0:15:46 | in the rls zero point one for that means |
|---|
| 0:15:49 | by scoring more points |
|---|
| 0:15:51 | she's actually don't better in the game |
|---|
| 0:15:53 | because her points per second has been lot higher |
|---|
| 0:15:56 | and in the necklace |
|---|
| 0:15:58 | subset |
|---|
| 0:15:59 | we see that even if the baseline agent has scored much less points |
|---|
| 0:16:03 | the points for second is very high |
|---|
| 0:16:05 | that's because |
|---|
| 0:16:06 | the baseline agent is very you got the one i just one some points by |
|---|
| 0:16:10 | chance |
|---|
| 0:16:12 | but rl is getting what points |
|---|
| 0:16:15 | basically by rating more as a result of which are b bs s one |
|---|
| 0:16:20 | so i want to discuss a little bit about if for and |
|---|
| 0:16:24 | the time |
|---|
| 0:16:25 | so that they systems are often criticised as being laborious and time-consuming the bit |
|---|
| 0:16:31 | they are but they actually have doesn't perform nearly as well as human so i |
|---|
| 0:16:37 | don't know if it's favouritism |
|---|
| 0:16:40 | and you know it nearly the same time the better the cd a baseline asked |
|---|
| 0:16:44 | to reinforcement learning policy no this is of course excluding |
|---|
| 0:16:47 | the data collection and the intersection building efforts |
|---|
| 0:16:51 | but |
|---|
| 0:16:51 | the advantage that we get is that rl approach is more scalable |
|---|
| 0:16:56 | because adding features is more easy and |
|---|
| 0:17:03 | so the future work exactly one |
|---|
| 0:17:05 | two |
|---|
| 0:17:06 | actually investigative are the best improvements transfer to the interactions which means we want to |
|---|
| 0:17:12 | you know like |
|---|
| 0:17:13 | put the policy of the reinforcement learning policy to the agent and see if you |
|---|
| 0:17:17 | can actually perform better |
|---|
| 0:17:19 | in the real user study |
|---|
| 0:17:21 | and then we want to explore adding more features or to the state space and |
|---|
| 0:17:25 | then |
|---|
| 0:17:26 | the reward function one alone how from the data using other in four the inverse |
|---|
| 0:17:30 | reinforcement learning |
|---|
| 0:17:33 | and finally thank you wanna time mike about the |
|---|
| 0:17:38 | and anonymous reviewers for their very useful comments and nsf and additive for supporting this |
|---|
| 0:17:43 | work |
|---|
| 0:17:43 | and this people for providing a images a second using this a particular paper |
|---|
| 0:17:50 | thank you very much so they questions |
|---|
| 0:17:57 | very much and now time for questions |
|---|
| 0:18:01 | take here so |
|---|
| 0:18:06 | i think you very much for a nice talk and just a clarification question regarding |
|---|
| 0:18:11 | a room for reinforcement learning setup c four i'm correct your learning from a corpus |
|---|
| 0:18:17 | right yep but here the using least squares for quality duration |
|---|
| 0:18:22 | but easy and onpolicy method which requires learning from interaction righted in learning from corpus |
|---|
| 0:18:31 | alright so |
|---|
| 0:18:33 | so this is the one of which expand so we kind of three this as |
|---|
| 0:18:38 | a real interaction |
|---|
| 0:18:39 | that is even though it's you know like |
|---|
| 0:18:42 | so for every hundred milliseconds as it would happen in a real infractions but user |
|---|
| 0:18:47 | or subdialogue |
|---|
| 0:18:49 | be kind of sample like based on each time-step rate for every for the first |
|---|
| 0:18:54 | hundred milliseconds we have a partial and for the first partial we have a small |
|---|
| 0:18:57 | we have the probability distribution that the fine and we have that and consume |
|---|
| 0:19:01 | so here we just use the probability distribution and the time |
|---|
| 0:19:04 | not like is in a like as a feature |
|---|
| 0:19:07 | and then |
|---|
| 0:19:08 | the next time sample hasn't happened in a real interaction the next thing that's happening |
|---|
| 0:19:13 | is |
|---|
| 0:19:13 | the next question is coming in |
|---|
| 0:19:15 | and the next part show that you know like is something that the user task |
|---|
| 0:19:18 | actually spoken you know like in the data that you're collector |
|---|
| 0:19:23 | and |
|---|
| 0:19:24 | you know like it keeps going on so basically it's train |
|---|
| 0:19:27 | but subdialogue |
|---|
| 0:19:29 | a image |
|---|
| 0:19:32 | but i still think you would gets improvement if you actually used something like importance |
|---|
| 0:19:37 | sampling |
|---|
| 0:19:38 | count the fact that you're tree very seeing a trajectory that happens in corpus project |
|---|
| 0:19:46 | and in an online exploration method which on policy reinforcement learning |
|---|
| 0:19:55 | i |
|---|
| 0:19:58 | that that's a good question i mean i have explored a bar |
|---|
| 0:20:02 | i guess that's according to you know like have explored |
|---|
| 0:20:06 | include |
|---|
| 0:20:20 | fix for talk all two questions first one can you explain a little bit more |
|---|
| 0:20:23 | how you can point you know you work with image recognition for the same using |
|---|
| 0:20:28 | some cnn model |
|---|
| 0:20:30 | we fix the vision |
|---|
| 0:20:31 | what we fake vision |
|---|
| 0:20:33 | okay so the nlu a sign |
|---|
| 0:20:35 | so the way the nlu strain does |
|---|
| 0:20:37 | we have the human data that we have collected that humans are actually describing the |
|---|
| 0:20:42 | corpus right |
|---|
| 0:20:43 | so we had other descriptions from the human examples where to humans was speaking and |
|---|
| 0:20:49 | they were describing the target image |
|---|
| 0:20:50 | so we had |
|---|
| 0:20:51 | the words that's associated image |
|---|
| 0:20:54 | so we |
|---|
| 0:20:55 | that's that i mean like that something that we really want to do that is |
|---|
| 0:20:59 | you know like user you images and then of get like |
|---|
| 0:21:03 | learn from the image rather than fig that we should but in this particular one |
|---|
| 0:21:07 | bit just learning from |
|---|
| 0:21:08 | doctor and did you play around with setting like actually do the work for we |
|---|
| 0:21:12 | do actually to be negative so you might speed up the so we tried like |
|---|
| 0:21:17 | a lot of different things so one thing is very and start of that lspi |
|---|
| 0:21:22 | in the beginning like the start of but you know like all the different algorithms |
|---|
| 0:21:25 | like |
|---|
| 0:21:27 | one of the example is we try to q-learning but you know it |
|---|
| 0:21:31 | lot more us samples like that if there was prosody and really trial negative it |
|---|
| 0:21:39 | what's for the weight actions but that would mean the agent this actually |
|---|
| 0:21:44 | in a like penalized weight but we don't really want back rate we want agent |
|---|
| 0:21:48 | to be assigned with higher rewards for doing well in the game rather than waiting |
|---|
| 0:21:53 | or you know like |
|---|
| 0:21:54 | the specific reward function manipulation we just one |
|---|
| 0:21:58 | i mean |
|---|
| 0:21:58 | the reward function is kind of |
|---|
| 0:22:00 | reflective of what's happening in the game |
|---|
| 0:22:03 | more points for |
|---|
| 0:22:08 | and flexible that well i just one the let us try switching the roles of |
|---|
| 0:22:13 | human the most we need the game like what would happen i |
|---|
| 0:22:17 | the machine have has to describe the actions so we v so currently |
|---|
| 0:22:25 | the agent is only in the matcher role |
|---|
| 0:22:28 | it's not playing the role of the director it becomes much more complex because we |
|---|
| 0:22:32 | have to incrementally generate the descriptions but that's something that we really want to know |
|---|
| 0:22:36 | like in |
|---|
| 0:22:37 | in the future work |
|---|
| 0:22:39 | we don't know how |
|---|
| 0:22:44 | thanks room is talking or just a quick question about this is the representation so |
|---|
| 0:22:48 | and four for purely from is the so you're putting the portals in this the |
|---|
| 0:22:53 | yes |
|---|
| 0:22:56 | the partials |
|---|
| 0:22:57 | no the state just have those the you know you just an idea okay so |
|---|
| 0:23:01 | you're not you're not like a morning instability |
|---|
| 0:23:03 | right okay should not being on the captured okay this portals like always talk a |
|---|
| 0:23:07 | bicycle but it's not it's like this you know be or something like that so |
|---|
| 0:23:10 | we can be faster if you put it was able you know the colonial wasn't |
|---|
| 0:23:14 | as the you're okay the partials right even its ability to learn you could learn |
|---|
| 0:23:17 | the most because in the original consistent |
|---|
| 0:23:22 | that is right i wanna shall one small taking your with the instability in the |
|---|
| 0:23:27 | case to you know like and then use course in a later let's because of |
|---|
| 0:23:32 | this instability |
|---|
| 0:23:33 | and what we actually want is and what actually happened in the game is |
|---|
| 0:23:40 | that is not like the nlu confidence |
|---|
| 0:23:42 | is actually you know like fluctuating |
|---|
| 0:23:45 | part i and nlu confidence you know like all these blips of |
|---|
| 0:23:49 | these things |
|---|
| 0:23:50 | you know like it's kind of |
|---|
| 0:23:52 | lower as a joint way of like probability and the fine so it's actually waiting |
|---|
| 0:23:56 | a lot more giving the chance to kind of the last but that's of a |
|---|
| 0:24:00 | question i mean we i mean i think if you if you had a more |
|---|
| 0:24:02 | information use the nothing you warning would probably be more successful "'cause" i think that |
|---|
| 0:24:06 | it's possible your maybe why we use of the dp weak assumptions little bit |
|---|
| 0:24:11 | so |
|---|
| 0:24:13 | adding more features to spit that's and |
|---|
| 0:24:20 | right |
|---|
| 0:24:20 | thank you thank you think we want to thank him speaker once again |
|---|