Towards the explainability of Multimodal Speech Emotion Recognition <BR>(3 minutes introduction)

Towards the explainability of Multimodal Speech Emotion Recognition
(3 minutes introduction)

Puneet Kumar (IIT Roorkee, India), Vishesh Kaushik (IIT Kanpur, India), Balasubramanian Raman (IIT Roorkee, India)

In this paper, a multimodal speech emotion recognition system has been developed, and a novel technique to explain its predictions has been proposed. The audio and textual features are extracted separately using attention-based Gated Recurrent Unit (GRU) and pre-trained Bidirectional Encoder Representations from Transformers (BERT), respectively. Then they are concatenated and used to predict the final emotion class. The weighted and unweighted emotion recognition accuracy of 71.7% and 75.0% has been achieved on Emotional Dyadic Motion Capture (IEMOCAP) dataset containing speech utterances and corresponding text transcripts. The training and predictions of network layers have been analyzed qualitatively through emotion embedding plots and quantitatively by analyzing the intersection matrices for various emotion classes’ embeddings.

Search in Audio

Related Recordings

Reliable estimates of interpretable cue effects with Active Learning in psycholinguistic research
(3 minutes introduction)

Marieke Einfeldt , Rita Sevastjanova , Katharina Zahner-Ritter , Ekaterina Kazak , Bettina Braun

Investigating the Impact of Spectral and Temporal Degradation on End-to-End Automatic Speech Recognition Performance
(3 minutes introduction)

Takanori Ashihara , Takafumi Moriya , Makio Kashino

InterSpeech 2021

Towards the explainability of Multimodal Speech Emotion Recognition (3 minutes introduction)

Search in Audio

Related Recordings

Reliable estimates of interpretable cue effects with Active Learning in psycholinguistic research (3 minutes introduction)

Investigating the Impact of Spectral and Temporal Degradation on End-to-End Automatic Speech Recognition Performance (3 minutes introduction)

Towards the explainability of Multimodal Speech Emotion Recognition
(3 minutes introduction)

Reliable estimates of interpretable cue effects with Active Learning in psycholinguistic research
(3 minutes introduction)

Investigating the Impact of Spectral and Temporal Degradation on End-to-End Automatic Speech Recognition Performance
(3 minutes introduction)