|Yifei Wu (SJTU, China), Chenda Li (SJTU, China), Song Yang (TAL, China), Zhongqin Wu (TAL, China), Yanmin Qian (SJTU, China)|
Speech from microphones is vulnerable in a complex acoustic environment due to noise and reverberation, while the cameras are not. Thus, utilizing the visual modality in the “cocktail party” scenario with multi-talkers has become a promising and popular approach. In this paper, we have explored the incorporating of visual modality into the end-to-end multi-talker speech recognition task. We propose two methods based on the modality fusion position, which are encoder-based fusion and decoder-based fusion. And for each method, advanced audio-visual fusion techniques including attention mechanism and dual decoder have been explored to find the best usage of the visual modality. With the proposed methods, our best audio-visual multi-talker automatic speech recognition (ASR) model gets almost ~50.0% word error rate (WER) reduction compared to the audio-only multi-talker ASR system.