The Application of Learnable STRF Kernels to the 2021 Fearless Steps Phase-03 SAD Challenge <BR>(Oral presentation)

The Application of Learnable STRF Kernels to the 2021 Fearless Steps Phase-03 SAD Challenge
(Oral presentation)

Tyler Vuong (Carnegie Mellon University, USA), Yangyang Xia (Carnegie Mellon University, USA), Richard M. Stern (Carnegie Mellon University, USA)

We describe a deep-learning-based system developed for the Fearless Steps Phase-03 Speech Activity Detection (SAD) challenge. The system includes both learnable spectro-temporal receptive fields (STRFs) and unconstrained 2-dimensional convolutional kernels in the first layer. Experiments show that the inclusion of learnable STRFs in the first layer increases the system’s robustness to additive noise. Additionally, we found that utilizing SpecAugment during training improves generalization on unseen data. By incorporating these enhancements and others our system achieved the best score in the official SAD challenge.

Search in Audio

Related Recordings

Unsupervised Representation Learning for Speech Activity Detection in the Fearless Steps Challenge 2021
(Oral presentation)

Pablo Gimeno , Alfonso Ortega , Antonio Miguel , Eduardo Lleida

Speech Activity Detection Based on Multilingual Speech Recognition System
(Oral presentation)

Seyyed Saeed Sarfjoo , Srikanth Madikeri , Petr Motlicek

InterSpeech 2021

The Application of Learnable STRF Kernels to the 2021 Fearless Steps Phase-03 SAD Challenge (Oral presentation)

Search in Audio

Related Recordings

Unsupervised Representation Learning for Speech Activity Detection in the Fearless Steps Challenge 2021 (Oral presentation)

Speech Activity Detection Based on Multilingual Speech Recognition System (Oral presentation)

The Application of Learnable STRF Kernels to the 2021 Fearless Steps Phase-03 SAD Challenge
(Oral presentation)

Unsupervised Representation Learning for Speech Activity Detection in the Fearless Steps Challenge 2021
(Oral presentation)

Speech Activity Detection Based on Multilingual Speech Recognition System
(Oral presentation)