Multimodal Hierarchical Reinforcement Learning Policy for Task-Oriented Visual Dialog

Jiaping Zhang, Tiancheng Zhao, Zhou Yu

Creating an intelligent conversational system that understands vision and language is one of the ultimate goals in Artificial Intelligence (AI) (Winograd, 1972). Extensive research has focused on vision-tolanguage generation, however, limited research has touched on combining these two modalities in a goal-driven dialog context. We propose a multimodal hierarchical reinforcement learning framework that dynamically integrates vision and language for task-oriented visual dialog. The framework jointly learns the multimodal dialog state representation and the hierarchical dialog policy to improve both dialog task success and efficiency. We also propose a new technique, state adaptation, to integrate context awareness in the dialog state representation. We evaluate the proposed framework and the state adaptation technique in an image guessing game and achieve promising results.

Switch Camera

Language-Guided Adaptive Perception for Efficient Grounded Communication with Robotic Manipulators in Cluttered Environments

Siddharth Patki and Thomas Howard

SIGdial 2018

19th Annual SIGdial Meeting on Discourse and Dialogue

Multimodal Hierarchical Reinforcement Learning Policy for Task-Oriented Visual Dialog

Search in Audio

Speech Transcript

Related Recordings

Predicting Perceived Age: Both Language Ability and Appearance are Important

Language-Guided Adaptive Perception for Efficient Grounded Communication with Robotic Manipulators in Cluttered Environments