InterSpeech 2021

Stochastic Process Regression for Cross-Cultural Speech Emotion Recognition
(longer introduction)

Mani Kumar T. (University of Nottingham, UK), Enrique Sanchez (University of Nottingham, UK), Georgios Tzimiropoulos (Queen Mary University of London, UK), Timo Giesbrecht (Unilever, UK), Michel Valstar (University of Nottingham, UK)
In this work, we pose continuous apparent emotion recognition from speech as a problem of learning distributions of functions, and do so using Stochastic Processes Regression. We presume that the relation between speech signals and their corresponding emotion labels is governed by some underlying stochastic process, in contrast to existing speech emotion recognition methods that are mostly based on deterministic regression models (static or recurrent). We treat each training sequence as an instance of the underlying stochastic process which we aim to discover using a neural latent variable model, which approximates the distribution of functions with a stochastic latent variable using an encoder-decoder composition: the encoder infers the distribution over the latent variable, which the decoder uses to predict the distribution of output emotion labels. To this end, we build on the previously proposed Neural Processes theory by using (a). noisy label predictions of a backbone instead of ground truth labels for latent variable inference and (b). recurrent encoder-decoder models to alleviate the effect of commonly encountered temporal misalignment between audio features and emotion labels due to annotator reaction lag. We validated our method on AVEC’19 cross-cultural emotion recognition dataset, achieving state-of-the-art results.