InterSpeech 2021

Speech Emotion Recognition with Multi-task Learning
(3 minutes introduction)

Xingyu Cai (Baidu, USA), Jiahong Yuan (Baidu, USA), Renjie Zheng (Baidu, USA), Liang Huang (Baidu, USA), Kenneth Church (Baidu, USA)
Speech emotion recognition (SER) classifies speech into emotion categories such as: Happy, Angry, Sad and Neutral. Recently, deep learning has been applied to the SER task. This paper proposes a multi-task learning (MTL) framework to simultaneously perform speech-to-text recognition and emotion classification, with an end-to-end deep neural model based on wav2vec-2.0. Experiments on the IEMOCAP benchmark show that the proposed method achieves the state-of-the-art performance on the SER task. In addition, an ablation study establishes the effectiveness of the proposed MTL framework.