|Takahiro Fukumori (Ritsumeikan University, Japan)|
Discrimination between shouted and normal speech is crucial in audio surveillance and monitoring. Although deep neural networks are used in recent methods, traditional low-level speech features are applied, such as mel-frequency cepstral coefficients and the mel spectrum. This paper presents a deep spectral-cepstral fusion approach that learns descriptive features for target classification from high-dimensional spectrograms and cepstrograms. We compare the following three types of architectures as base networks: convolutional neural networks (CNNs), gated recurrent unit (GRU) networks, and their combination (CNN-GRU). Using a corpus comprising real shouts and speech, we present a comprehensive comparison with conventional methods to verify the effectiveness of the proposed feature learning method. The results of experiments conducted in various noisy environments demonstrate that the CNN-GRU based on our spectral-cepstral features achieves better classification performance than single feature-based networks. This finding suggests the effectiveness of using high-dimensional sources for speech-type recognition in sound event detection.