Speech Transcript - Embodied Question Answering

hi everyone i'm up to shake i'm not be achieved to than a joystick

identity and when you presenting our work on embody question answering this is joint work

with my collaborators at georgia tech and facebook a research

so in this work we propose a new task called embody question answering the task

is that there's an agent that's point it random location in an unseen environment and

exhaustive question such as what colours the car

in order to sixty the agent must understand the question navagati environment find the object

that the question asked about and respond back with the onset

so we begin by proposing a data set of questions in environments for this task

so for environments we use house three d which is work out of this book

a research in building a rich and interactive enviroment out of this one cg dataset

and so to give us sensible this data looks like here are a few questions

from a three d

you know if you living rooms

and here are a few buttons rooms

so as you can see there's rate and i were set of colours textures objects

and their spatial configurations

so in total we use eight hundred environments from house three d for this work

consisting of twelve context and fifty object types and we make sure that there's no

overlap between the training validation and test environments so we strictly check for generalization to

novel advance

coming to questions are questions are generated programmatically in a manner similar to clever in

that we have set several primitive functions that can be combined and executed on these

environments to generate a whole bunch of questions

give an example executing select objects on environment returns a list of objects present and

then and parameter passing that list a singleton will filter it again objects that a

can only once

and we can then played the location for each object in that set we generate

a whole bunch of location questions such as what rumours the piano located in what

rumours the dog located in what with the cutting board located in and still

here's another example when we combine these primitive functions in a different combination to generate

a whole bunch of colour question so what colours the base station in the living

room what colours that are in the gym and still

in total we have several question types would for this initial work we focus on

location colour template based preposition questions that focus at that ask questions about a single

target object

and additionally as a post-processing step we make sure that the onset distributions for these

questions on creaky so that the agent actually has to navigate to be a bit

onset accurately and cannot exploit basis

and all this data is publicly available for download on embodied q don't or

coming to and martin it consists of four components division language navigation on saying what

use the vision a module is a four layer convolutional neural network which is speech

input reconstruction semantic segmentation and that estimation

once it speech aim we tore with the decoders and just use the encoded as

a fixed feature extractor

i language module is the is an lstm that extracts a fixed size representation of

the question

we have a hierarchical navigation policy consisting of a planner that x which action to

perform and a controller that decides how many time steps to execute each action for

and so here's what it looks like in practice we extract image features using the

cnn a condition on these image features in the question the planner decides which action

to perform so in this case it decides to turn-right

control is then passed to the controller

the control that it has to decide whether to continue turning right ordered uncontrolled of

the planner so in this case it decides to don't control and that computes one

time step of the planet

okay and at the next time step the planner looks at the image features in

the question and decides which action to perform so here to explore control is part

of the controller the controller decides to continue moving forward for three time steps before

handing back controlled of the plan

and this sort of continues until finally the planner decides to stop

fertilising

we extract question application using an lstm where you and we compute attention over the

last five image frames from the navigation trajectory we combine these attended image features with

the question of presentation to make a prediction of the onset

now that we have these form audience coming to training data is as a reminder

a in order to respond the agent at a at a time the location in

an environment here i'm showing the top-down map

we ask the questions that is what room of the csi located in the red

star shows the location of this dataset so that's where the agent is expected to

navigate a short response might look some something like anybody here's the first person video

that short response to this expert agent will say i guess

and a given the shortest path we can collegian out on thing module to be

able to predict the onset from the last five three

and we pretty general navigation module in a teacher forcing minded pretty each action in

the shortest

and once we have these two modules preaching defined units reinforcement learning about the agent

an environment sound that actions from this navigation policy execute these actions in the environment

and assign an intermediate award for when it makes a progress towards the target

and when it when the agent chooses to start with we execute the onset of

and assign determine what if the using gets the onset

in terms of metrics again i'm showing that are not so the right plot shows

what am agents trajectory might look like so given an agent's final location we can

evaluate what is the finer distance target and what is the improvement in distance we

also compute whether the agent enters that ends up in the right room

or if it ever choose just are not and for on setting we look at

the mean lack of the ground truth onset in the softmax distribution predicted by the

so in terms of results on the distance the target matrix a low it is

like a so here i'm showing a few baselines first adding in question information or

whatever prior based navigation module has attained end up closer to the target by about

half a meter adding an entity in the form of an lstm had to do

even better by about how to make good

and finally a hierarchical policy ends up close to the doctor

so here are a few qualitative examples of for the question what color is the

fish tank in the living room i'm showing the baseline lstm model on the left

so the baseline model tones looks at the fish tank would what's right out of

the house so it doesn't know where to start and it finally gets the onset

all

what is a lot more turns looks at the four test and what's up to

start and get you select

here's another example so the question is what colours the bottom

the baseline model tones but get stuck against a wall

but is are modeled is also to the button stops and gets the onset

to so as to summarize i introduce the task of more question answering which involves

navigation and question answering and these simulated house three environments we propose a dataset for

this task and we proposed a hierarchical navigation policy of the of unseasonably against competitive

baseline

all of this data and code is publicly available say got it you to check

that out

that's is thank you

so by taking the navigator into your model gives you make an assumption about how

the system can navigate and you're building

if you have a lady system or so we'll system you can imagine learning very

different policies value that you multi storey building you assess on how you might

generalize the model in this is the right extraction really try to understand

how to solve the problem

i mean that's a good question i don't think i'm the type or seem to

be on single right now we're abstracting away all that it is related to what

the specific hardware might be and b are assuming no stochastic no stochastic city in

the environment

we are assuming that executing for will always and point five meters

were taken for seven how can we go

i mean

one

so the action space will change depending on what specific hardware you have access to

you could

i could imagine

a training i some of these models

conditioned on the specific hardware parameters that they have to the might have to be

but if we had access to those

but i and say i don't have anything young

i think if it

what ideas of the model comes from the people time from the language part from

the an additional

so i missed the first point in the other side of model come from

the way the task is set up the agent has heavy it clearly from first

person vision it doesn't have a map of the environment

i think that's where most of the others come from navigating just from first person

vision even in the simulated environment is extremely hard to get the work so in

more so i skip those leaders in this presentation but if you know people we

have that

for evaluating we evaluate the agent in different difficulty levels but we initially bring it

back and steps from the target than thirty then fifty and see how well it

does so i

not at the most difficult level it has to just cost one room what anything

beyond

it doesn't do a really good job at so i think navigation is the is

the hardest part

Embodied Question Answering

Special Session: Late-breaking and work-in-progress talks

Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, Dhruv Batra