|Pei-Hao Su, Paweł Budzianowski, Stefan Ultes, Milica Gasic and Steve Young|
Deep reinforcement learning (RL) methods have significant potential for dialogue policy optimisation. However, they suffer from a poor performance in the early stages of learning. This is especially problematic for on-line learning with real users. Two approaches are introduced to tackle this problem. Firstly, to speed up the learning process, two sampleefficient neural networks algorithms: trust region actor-critic with experience replay (TRACER) and episodic natural actorcritic with experience replay (eNACER) are presented. For TRACER, the trust region helps to control the learning step size and avoid catastrophic model changes. For eNACER, the natural gradient identifies the steepest ascent direction in policy space to speed up the convergence. Both models employ off-policy learning with experience replay to improve sampleefficiency. Secondly, to mitigate the cold start issue, a corpus of demonstration data is utilised to pre-train the models prior to on-line reinforcement learning. Combining these two approaches, we demonstrate a practical approach to learning deep RLbased dialogue policies and demonstrate their effectiveness in a task-oriented information seeking domain.