InterSpeech 2021

Deep Spectral-Cepstral Fusion for Shouted and Normal Speech Classification
(3 minutes introduction)

Takahiro Fukumori (Ritsumeikan University, Japan)
Discrimination between shouted and normal speech is crucial in audio surveillance and monitoring. Although deep neural networks are used in recent methods, traditional low-level speech features are applied, such as mel-frequency cepstral coefficients and the mel spectrum. This paper presents a deep spectral-cepstral fusion approach that learns descriptive features for target classification from high-dimensional spectrograms and cepstrograms. We compare the following three types of architectures as base networks: convolutional neural networks (CNNs), gated recurrent unit (GRU) networks, and their combination (CNN-GRU). Using a corpus comprising real shouts and speech, we present a comprehensive comparison with conventional methods to verify the effectiveness of the proposed feature learning method. The results of experiments conducted in various noisy environments demonstrate that the CNN-GRU based on our spectral-cepstral features achieves better classification performance than single feature-based networks. This finding suggests the effectiveness of using high-dimensional sources for speech-type recognition in sound event detection.