0:00:15hi everyone i'm up to shake i'm not be achieved to than a joystick
0:00:19identity and when you presenting our work on embody question answering this is joint work
0:00:24with my collaborators at georgia tech and facebook a research
0:00:29so in this work we propose a new task called embody question answering the task
0:00:33is that there's an agent that's point it random location in an unseen environment and
0:00:38exhaustive question such as what colours the car
0:00:41in order to sixty the agent must understand the question navagati environment find the object
0:00:46that the question asked about and respond back with the onset
0:00:51so we begin by proposing a data set of questions in environments for this task
0:00:56so for environments we use house three d which is work out of this book
0:01:00a research in building a rich and interactive enviroment out of this one cg dataset
0:01:06and so to give us sensible this data looks like here are a few questions
0:01:09from a three d
0:01:14you know if you living rooms
0:01:20and here are a few buttons rooms
0:01:23so as you can see there's rate and i were set of colours textures objects
0:01:26and their spatial configurations
0:01:29so in total we use eight hundred environments from house three d for this work
0:01:33consisting of twelve context and fifty object types and we make sure that there's no
0:01:38overlap between the training validation and test environments so we strictly check for generalization to
0:01:42novel advance
0:01:45coming to questions are questions are generated programmatically in a manner similar to clever in
0:01:49that we have set several primitive functions that can be combined and executed on these
0:01:54environments to generate a whole bunch of questions
0:01:58give an example executing select objects on environment returns a list of objects present and
0:02:03then and parameter passing that list a singleton will filter it again objects that a
0:02:08can only once
0:02:10and we can then played the location for each object in that set we generate
0:02:13a whole bunch of location questions such as what rumours the piano located in what
0:02:17rumours the dog located in what with the cutting board located in and still
0:02:23here's another example when we combine these primitive functions in a different combination to generate
0:02:27a whole bunch of colour question so what colours the base station in the living
0:02:30room what colours that are in the gym and still
0:02:33in total we have several question types would for this initial work we focus on
0:02:38location colour template based preposition questions that focus at that ask questions about a single
0:02:43target object
0:02:44and additionally as a post-processing step we make sure that the onset distributions for these
0:02:49questions on creaky so that the agent actually has to navigate to be a bit
0:02:52onset accurately and cannot exploit basis
0:02:57and all this data is publicly available for download on embodied q don't or
0:03:01coming to and martin it consists of four components division language navigation on saying what
0:03:06use the vision a module is a four layer convolutional neural network which is speech
0:03:11input reconstruction semantic segmentation and that estimation
0:03:15once it speech aim we tore with the decoders and just use the encoded as
0:03:18a fixed feature extractor
0:03:20i language module is the is an lstm that extracts a fixed size representation of
0:03:25the question
0:03:26we have a hierarchical navigation policy consisting of a planner that x which action to
0:03:31perform and a controller that decides how many time steps to execute each action for
0:03:36and so here's what it looks like in practice we extract image features using the
0:03:41cnn a condition on these image features in the question the planner decides which action
0:03:45to perform so in this case it decides to turn-right
0:03:48control is then passed to the controller
0:03:51the control that it has to decide whether to continue turning right ordered uncontrolled of
0:03:55the planner so in this case it decides to don't control and that computes one
0:04:00time step of the planet
0:04:01okay and at the next time step the planner looks at the image features in
0:04:04the question and decides which action to perform so here to explore control is part
0:04:09of the controller the controller decides to continue moving forward for three time steps before
0:04:13handing back controlled of the plan
0:04:15and this sort of continues until finally the planner decides to stop
0:04:24we extract question application using an lstm where you and we compute attention over the
0:04:29last five image frames from the navigation trajectory we combine these attended image features with
0:04:34the question of presentation to make a prediction of the onset
0:04:39now that we have these form audience coming to training data is as a reminder
0:04:43a in order to respond the agent at a at a time the location in
0:04:47an environment here i'm showing the top-down map
0:04:50we ask the questions that is what room of the csi located in the red
0:04:53star shows the location of this dataset so that's where the agent is expected to
0:04:56navigate a short response might look some something like anybody here's the first person video
0:05:03that short response to this expert agent will say i guess
0:05:07and a given the shortest path we can collegian out on thing module to be
0:05:11able to predict the onset from the last five three
0:05:13and we pretty general navigation module in a teacher forcing minded pretty each action in
0:05:18the shortest
0:05:20and once we have these two modules preaching defined units reinforcement learning about the agent
0:05:25an environment sound that actions from this navigation policy execute these actions in the environment
0:05:30and assign an intermediate award for when it makes a progress towards the target
0:05:35and when it when the agent chooses to start with we execute the onset of
0:05:39and assign determine what if the using gets the onset
0:05:44in terms of metrics again i'm showing that are not so the right plot shows
0:05:49what am agents trajectory might look like so given an agent's final location we can
0:05:53evaluate what is the finer distance target and what is the improvement in distance we
0:05:58also compute whether the agent enters that ends up in the right room
0:06:02or if it ever choose just are not and for on setting we look at
0:06:05the mean lack of the ground truth onset in the softmax distribution predicted by the
0:06:10so in terms of results on the distance the target matrix a low it is
0:06:13like a so here i'm showing a few baselines first adding in question information or
0:06:18whatever prior based navigation module has attained end up closer to the target by about
0:06:22half a meter adding an entity in the form of an lstm had to do
0:06:26even better by about how to make good
0:06:28and finally a hierarchical policy ends up close to the doctor
0:06:34so here are a few qualitative examples of for the question what color is the
0:06:38fish tank in the living room i'm showing the baseline lstm model on the left
0:06:42so the baseline model tones looks at the fish tank would what's right out of
0:06:45the house so it doesn't know where to start and it finally gets the onset
0:06:51what is a lot more turns looks at the four test and what's up to
0:06:54start and get you select
0:06:57here's another example so the question is what colours the bottom
0:07:00the baseline model tones but get stuck against a wall
0:07:03but is are modeled is also to the button stops and gets the onset
0:07:08to so as to summarize i introduce the task of more question answering which involves
0:07:12navigation and question answering and these simulated house three environments we propose a dataset for
0:07:17this task and we proposed a hierarchical navigation policy of the of unseasonably against competitive
0:07:23all of this data and code is publicly available say got it you to check
0:07:27that out
0:07:28that's is thank you
0:07:52so by taking the navigator into your model gives you make an assumption about how
0:07:58the system can navigate and you're building
0:08:00if you have a lady system or so we'll system you can imagine learning very
0:08:05different policies value that you multi storey building you assess on how you might
0:08:10generalize the model in this is the right extraction really try to understand
0:08:14how to solve the problem
0:08:18i mean that's a good question i don't think i'm the type or seem to
0:08:21be on single right now we're abstracting away all that it is related to what
0:08:25the specific hardware might be and b are assuming no stochastic no stochastic city in
0:08:30the environment
0:08:31we are assuming that executing for will always and point five meters
0:08:37were taken for seven how can we go
0:08:42i mean
0:08:45so the action space will change depending on what specific hardware you have access to
0:08:50you could
0:08:51i could imagine
0:08:53a training i some of these models
0:08:56conditioned on the specific hardware parameters that they have to the might have to be
0:09:00but if we had access to those
0:09:02but i and say i don't have anything young
0:09:09i think if it
0:09:11what ideas of the model comes from the people time from the language part from
0:09:16the an additional
0:09:17so i missed the first point in the other side of model come from
0:09:24the way the task is set up the agent has heavy it clearly from first
0:09:27person vision it doesn't have a map of the environment
0:09:29i think that's where most of the others come from navigating just from first person
0:09:33vision even in the simulated environment is extremely hard to get the work so in
0:09:38more so i skip those leaders in this presentation but if you know people we
0:09:42have that
0:09:43for evaluating we evaluate the agent in different difficulty levels but we initially bring it
0:09:49back and steps from the target than thirty then fifty and see how well it
0:09:53does so i
0:09:56not at the most difficult level it has to just cost one room what anything
0:10:04it doesn't do a really good job at so i think navigation is the is
0:10:07the hardest part