InterSpeech 2021

Audio-Visual Information Fusion Using Cross-modal Teacher-Student Learning for Voice Activity Detection in Realistic Environments
(3 minutes introduction)

Hengshun Zhou (USTC, China), Jun Du (USTC, China), Hang Chen (USTC, China), Zijun Jing (iFLYTEK, China), Shifu Xiong (iFLYTEK, China), Chin-Hui Lee (Georgia Tech, USA)
We propose an information fusion approach to audio-visual voice activity detection (AV-VAD) based on cross-modal teacher-student learning leveraging on factorized bilinear pooling (FBP) and Kullback-Leibler (KL) regularization. First, we design an audio-visual network by using FBP fusion to fully utilize the interaction between audio and video modalities. Next, to transfer the rich information in audio-based VAD (A-VAD) model trained with a massive audio-only dataset to AV-VAD model built with relatively limited multi-modal data, a cross-modal teacher-student learning framework is then proposed based on cross entropy with regulated KL-divergence. Finally, evaluated on an in-house dataset recorded in realistic conditions using standard VAD metrics, the proposed approach yields consistent and significant improvements over other state-of-the-art techniques. Moreover, by applying our AV-VAD technique to an audio-visual Chinese speech recognition task, the character error rate is reduced by 24.15% and 8.66% from A-VAD and the baseline AV-VAD systems, respectively.