InterSpeech 2021

Applying TDNN Architectures for Analyzing Duration Dependencies on Speech Emotion Recognition
(longer introduction)

Pooja Kumawat (IIT Kharagpur, India), Aurobinda Routray (IIT Kharagpur, India)
We have analyzed the Time Delay Neural Network (TDNN) based architectures for speech emotion classification. TDNN models efficiently capture the temporal information and provide an utterance level prediction. Emotions are dynamic in nature and require temporal context for reliable prediction. In our work, we have applied the TDNN based x-vector and emphasized channel attention, propagation & aggregation based TDNN (ECAPA-TDNN) architectures for speech emotion identification with RAVDESS, Emo-DB, and IEMOCAP databases. The results show that the TDNN architectures are very efficient for predicting emotion classes and ECAPA-TDNN outperforms the TDNN based x-vector architecture. Next, we investigated the performance of ECAPA-TDNN with various training chunk durations and test utterance durations. We have identified that in spite of very promising emotion recognition performance the TDNN models have a strong training chunk duration-based bias. Earlier research work revealed that individual emotion class accuracy depends largely on the test utterance duration. Most of these studies were based on frame level emotions predictions. However, utterance level based emotion recognition is relatively less explored. The results show that even with the TDNN models, the accuracy of the different emotion classes is dependent on the utterance duration.