InterSpeech 2021

Separation of Emotional and Reconstruction Embeddings on Ladder Network to Improve Speech Emotion Recognition Robustness in Noisy Conditions
(Oral presentation)

Seong-Gyun Leem (University of Texas at Dallas, USA), Daniel Fulford (Boston University, USA), Jukka-Pekka Onnela (Harvard University, USA), David Gard (San Francisco State University, USA), Carlos Busso (University of Texas at Dallas, USA)
When speech emotion recognition (SER) is applied in an actual application, the system should be able to cope with audio acquired in a noisy, unconstrained environment. Most studies on noise-robust SER require a parallel dataset with emotion labels, which is impractical to collect, or use speech with artificially added noise, which does not resemble practical conditions. This study builds upon the ladder network formulation, which can effectively compensate the environmental differences between a clean speech corpus and real-life recordings. This study proposes a decoupled ladder network, which increases the robustness of the SER system against the influences of non-stationary background noise by decoupling the last hidden layer embedding into emotion and reconstruction embeddings. This novel implementation allows the emotion embedding to focus exclusively on building a discriminative representation, without worrying about the reconstruction task. We introduce a noisy version of the MSP-Podcast database, which contains audio segments collected with a smartphone that simultaneously records sentences from the corpus and non-stationary noise at different signal-to-noise ratios (SNRs). We test the effectiveness of our proposed model with this corpus, showing that the decoupled ladder network can increase the performance of the regular ladder network when dealing with noisy recordings.