Real-time End-to-End Monaural Multi-speaker Speech Recognition 
(Oral presentation)
        
       
        | Song Li (Xiamen University, China), Beibei Ouyang (Xiamen University, China), Fuchuan Tong (Xiamen University, China), Dexin Liao (Xiamen University, China), Lin Li (Xiamen University, China), Qingyang Hong (Xiamen University, China) | 
|---|
The rising interest in single-channel multi-speaker speech separation has triggered the development of end-to-end multi-speaker automatic speech recognition (ASR). However, until now, most systems have adopted autoregressive mechanisms for decoding, resulting in slow decoding speed, which is not conducive to the application of multi-speaker speech recognition in real-world environments. In this paper, we first comprehensively investigate and compare the mainstream end-to-end multi-speaker speech recognition systems. Secondly, we improve the recently proposed non-autoregressive end-to-end speech recognition model Mask-CTC, and introduce it to multi-speaker speech recognition to achieve real-time decoding. Our experiments on the LibriMix data set show that under the premise of the same amount of parameters, the non-autoregressive model achieves performance close to that of the autoregressive model while having a faster decoding speed.
	
		
	
	                       
      







