LOCALIZATION OF NON-LINGUISTIC EVENTS IN SPONTANEOUS SPEECH BY NON-NEGATIVE MATRIX FACTORIZATION AND LONG SHORT-TERM MEMORY
Audio/Visual Detection of Non-Linguistic Vocal Outbursts
Presented by: Felix Weninger, Author(s): Felix Weninger, Björn Schuller, Martin Wöllmer, Gerhard Rigoll, Technische Universität München, Germany
Features generated by Non-Negative Matrix Factorization (NMF) have successfully been introduced into robust speech processing, including noise-robust speech recognition and detection of non-linguistic vocalizations. In this study, we introduce a novel tandem approach by integrating likelihood features derived from NMF into Bidirectional Long Short-Term Memory Recurrent Neural Networks (BLSTM-RNNs) in order to dynamically localize non-linguistic events, i.e., laughter, vocal, and non-vocal noise, in highly spontaneous speech. We compare our tandem architecture to a baseline conventional phoneme-HMM-based speech recognizer, and achieve a relative reduction of the frame error rate by 37.5% in the discrimination of speech and different non-speech segments.