InterSpeech 2021

Real-time End-to-End Monaural Multi-speaker Speech Recognition
(Oral presentation)

Song Li (Xiamen University, China), Beibei Ouyang (Xiamen University, China), Fuchuan Tong (Xiamen University, China), Dexin Liao (Xiamen University, China), Lin Li (Xiamen University, China), Qingyang Hong (Xiamen University, China)
The rising interest in single-channel multi-speaker speech separation has triggered the development of end-to-end multi-speaker automatic speech recognition (ASR). However, until now, most systems have adopted autoregressive mechanisms for decoding, resulting in slow decoding speed, which is not conducive to the application of multi-speaker speech recognition in real-world environments. In this paper, we first comprehensively investigate and compare the mainstream end-to-end multi-speaker speech recognition systems. Secondly, we improve the recently proposed non-autoregressive end-to-end speech recognition model Mask-CTC, and introduce it to multi-speaker speech recognition to achieve real-time decoding. Our experiments on the LibriMix data set show that under the premise of the same amount of parameters, the non-autoregressive model achieves performance close to that of the autoregressive model while having a faster decoding speed.