Noise-Tolerant Self-Supervised Learning for Audio-Visual Voice Activity Detection <BR>(3 minutes introduction)

Noise-Tolerant Self-Supervised Learning for Audio-Visual Voice Activity Detection
(3 minutes introduction)

Ui-Hyun Kim (Toshiba, Japan)

Recent audio-visual voice activity detectors based on supervised learning require large amounts of labeled training data with manual mouth-region cropping in videos, and the performance is sensitive to a mismatch between the training and testing noise conditions. This paper introduces contrastive self-supervised learning for audio-visual voice activity detection as a possible solution to such problems. In addition, a novel self-supervised learning framework is proposed to improve overall training efficiency and testing performance on noise-corrupted datasets, as in real-world scenarios. This framework includes a branched audio encoder and a noise-tolerant loss function to cope with the uncertainty of speech and noise feature separation in a self-supervised manner. Experimental results, particularly under mismatched noise conditions, demonstrate the improved performance compared with a self-supervised learning baseline and a supervised learning framework.

Search in Audio

Related Recordings

Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams
(3 minutes introduction)

Yuanbo Hou , Zhesong Yu , Xia Liang , Xingjian Du , Bilei Zhu , Zejun Ma , Dick Botteldooren

Multi-Channel VAD for Transcription of Group Discussion
(3 minutes introduction)

Osamu Ichikawa , Kaito Nakano , Takahiro Nakayama , Hajime Shirouzu

InterSpeech 2021

Noise-Tolerant Self-Supervised Learning for Audio-Visual Voice Activity Detection (3 minutes introduction)

Search in Audio

Related Recordings

Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams (3 minutes introduction)

Multi-Channel VAD for Transcription of Group Discussion (3 minutes introduction)

Noise-Tolerant Self-Supervised Learning for Audio-Visual Voice Activity Detection
(3 minutes introduction)

Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams
(3 minutes introduction)

Multi-Channel VAD for Transcription of Group Discussion
(3 minutes introduction)