InterSpeech 2021

Timing Generating Networks: Neural Network based Precise Turn-taking Timing Prediction in Multiparty Conversation
(3 minutes introduction)

Shinya Fujie (Chiba Institute of Technology, Japan), Hayato Katayama (Waseda University, Japan), Jin Sakuma (Waseda University, Japan), Tetsunori Kobayashi (Waseda University, Japan)
A brand new neural network based precise timing generation framework, named the Timing Generating Network (TGN), is proposed and applied to turn-taking timing decision problems. Although turn-taking problems have conventionally been formalized as users’ end-of-turn detection, this approach cannot estimate the precise timing at which a spoken dialogue system should take a turn to start its utterance. Since several conventional approaches estimate precise timings but the estimation executed only at/after the end of preceding user’s utterance, they highly depend on the accuracy of intermediate decision modules, such as voice activity detection, etc. The advantages of the TGN are that its parameters are tunable via error backpropagation as it is described in a differentiable form as a whole, and it is free from inter-module error propagation as it has no deterministic intermediate modules. The experimental results show that the proposed system is superior to a conventional turn-taking system that adopts the hard decisions on user’s voice activity detection and response time estimation.