InterSpeech 2021

Speech Enhancement with Weakly Labelled Data from AudioSet
(3 minutes introduction)

Qiuqiang Kong (ByteDance, China), Haohe Liu (ByteDance, China), Xingjian Du (ByteDance, China), Li Chen (ByteDance, China), Rui Xia (ByteDance, China), Yuxuan Wang (ByteDance, China)
Speech enhancement is a task to improve the intelligibility and perceptual quality of degraded speech signals. Recently, neural network-based methods have been applied to speech enhancement. However, many neural network-based methods require users to collect clean speech and background noise for training, which can be time-consuming. In addition, speech enhancement systems trained on particular types of background noise may not generalize well to a wide range of noise. To tackle those problems, we propose a speech enhancement framework trained on weakly labelled data. We first apply a pretrained sound event detection system to detect anchor segments that contain sound events in audio clips. Then, we randomly mix two detected anchor segments as a mixture. We build a conditional source separation network using the mixture and a conditional vector as input. The conditional vector is obtained from the audio tagging predictions on the anchor segments. In inference, we input a noisy speech signal with the one-hot encoding of “Speech” as a condition to the trained system to predict enhanced speech. Our system achieves a PESQ of 2.28 and an SSNR of 8.75 dB on the VoiceBank-DEMAND dataset, outperforming the previous SEGAN system of 2.16 and 7.73 dB respectively.