InterSpeech 2021

Self-Supervised Learning Based Phone-Fortified Speech Enhancement
(3 minutes introduction)

Yuanhang Qiu (Massey University, New Zealand), Ruili Wang (Massey University, New Zealand), Satwinder Singh (Massey University, New Zealand), Zhizhong Ma (Massey University, New Zealand), Feng Hou (Massey University, New Zealand)
For speech enhancement, deep complex network based methods have shown promising performance due to their effectiveness in dealing with complex-valued spectrums. Recent speech enhancement methods focus on further optimization of network structures and hyperparameters, however, ignore inherent speech characteristics (e.g., phonetic characteristics), which are important for networks to learn and reconstruct speech information. In this paper, we propose a novel self-supervised learning based phone-fortified (SSPF) method for speech enhancement. Our method explicitly imports phonetic characteristics into a deep complex convolutional network via a Contrastive Predictive Coding (CPC) model pre-trained with self-supervised learning. This operation can greatly improve speech representation learning and speech enhancement performance. Moreover, we also apply the self-attention mechanism to our model for learning long-range dependencies of a speech sequence, which further improves the performance of speech enhancement. The experimental results demonstrate that our SSPF method outperforms existing methods and achieves state-of-the-art performance in terms of speech quality and intelligibility.