InterSpeech 2021

Speech Emotion Recognition based on Attention Weight Correction Using Word-level Confidence Measure
(longer introduction)

Jennifer Santoso (University of Tsukuba, Japan), Takeshi Yamada (University of Tsukuba, Japan), Shoji Makino (University of Tsukuba, Japan), Kenkichi Ishizuka (Revcomm, Japan), Takekatsu Hiramura (Revcomm, Japan)
Emotion recognition is essential for human behavior analysis and possible through various inputs such as speech and images. However, in practical situations, such as in call center analysis, the available information is limited to speech. This leads to the study of speech emotion recognition (SER). Considering the complexity of emotions, SER is a challenging task. Recently, automatic speech recognition (ASR) has played a role in obtaining text information from speech. The combination of speech and ASR results has improved the SER performance. However, ASR results are highly affected by speech recognition errors. Although there is a method to improve ASR performance on emotional speech, it requires the fine-tuning of ASR, which is costly. To mitigate the errors in SER using ASR systems, we propose the use of the combination of a self-attention mechanism and a word-level confidence measure (CM), which indicates the reliability of ASR results, to reduce the importance of words with a high chance of error. Experimental results confirmed that the combination of self-attention mechanism and CM reduced the effects of incorrectly recognized words in ASR results, providing a better focus on words that determine emotion recognition. Our proposed method outperformed the state-of-the-art methods on the IEMOCAP dataset.