InterSpeech 2021

Voice Activity Detection With Teacher-Student Domain Emulation
(Oral presentation)

Jarrod Luckenbaugh (University of Texas at Dallas, USA), Samuel Abplanalp (Boston University, USA), Rachel Gonzalez (San Francisco State University, USA), Daniel Fulford (Boston University, USA), David Gard (San Francisco State University, USA), Carlos Busso (University of Texas at Dallas, USA)
Transfer learning is a promising approach to increase performance for many speech-based systems, including voice activity detection (VAD). Domain adaptation, a subfield of transfer learning, often improves model conditioning in the presence of a mismatch between train-test conditions. This study proposes a formulation for VAD based on the teacher-student training, where the teacher model, trained with clean data, transfers knowledge to the student model trained with a noisy, paired version of the corpus resembling the test conditions. The models leverage temporal information using recurrent neural networks (RNN), implemented with either bidirectional long short term memory (BLSTM) or the modern, continuous-state Hopfield network. We provide evidence that in-domain noise emulation for domain adaptation is viable under unconstrained audio channel conditions for VAD “in the wild.” Our application domain is in healthcare, where multimodal sensors, including microphones, from portable devices are used to automatically predict social isolation in patients affected by schizophrenia. We empirically show positive results for domain emulation when the training conditions are similar to the target domain. We also show that the Hopfield network outperforms our best BLSTM for VAD on real-world benchmarks.