InterSpeech 2021

Speech Activity Detection Based on Multilingual Speech Recognition System
(Oral presentation)

Seyyed Saeed Sarfjoo (Idiap Research Institute, Switzerland), Srikanth Madikeri (Idiap Research Institute, Switzerland), Petr Motlicek (Idiap Research Institute, Switzerland)
To better model the contextual information and increase the generalization ability of the Speech Activity Detection (SAD) system, this paper leverages a multilingual Automatic Speech Recognition (ASR) system to perform SAD. Sequence-discriminative training of Acoustic Model (AM) using Lattice-Free Maximum Mutual Information (LF-MMI) loss function, effectively extracts the contextual information of the input acoustic frame. Multilingual AM training causes the robustness to noise and language variabilities. The index of maximum output posterior is considered as a frame-level speech/non-speech decision function. Majority voting and logistic regression are applied to fuse the language-dependent decisions. The multilingual ASR is trained on 18 languages of BABEL datasets and the built SAD is evaluated on 3 different languages. On out-of-domain datasets, the proposed SAD model shows significantly better performance with respect to baseline models. On the Ester2 dataset, without using any in-domain data, this model outperforms the WebRTC, phoneme recognizer based VAD (Phn_Rec), and Pyannote baselines (respectively by 7.1, 1.7, and 2.7% absolute) in Detection Error Rate (DetER) metrics. Similarly, on the LiveATC dataset, this model outperforms the WebRTC, Phn_Rec, and Pyannote baselines (respectively by 6.4, 10.0, and 3.7% absolutely) in DetER metrics.