|Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, Dhruv Batra|
We present a new AI task – Embodied Question Answering (EmbodiedQA) – where an agent is spawned at a random location in a 3D environment and asked a question (‘What color is the car?’). In order to answer, the agent must first intelligently navigate to explore the environment, gather necessary visual information through first-person (egocentric) vision, and answer the question (‘orange’). EmbodiedQA requires a range of AI skills – language understanding, visual recognition, active perception, goal-driven navigation, commonsense reasoning, longterm memory, and grounding language into actions. In this work, we develop a dataset of questions and answers in House3D environments (Wu et al., 2018), evaluation metrics, and a hierarchical model trained with imitation and reinforcement learning for this task.