I-Vectors for speech activity detection

Elie Khoury, Matt Garland

I-Vectors are low dimensional front-end features known to effectively preserve the total variability of the signal. Motivated by their successful use for several classification problems such as speaker, language and face recognition, this paper introduces i-vectors for the task of speech activity detection (SAD). In contrast to most state-of-the-art SAD methods that operate at the frame or segment level, this paper proposes a cluster-based SAD, for which two algorithms were investigated: the first is based on generalized likelihood ratio (GLR) and Bayesian information criterion (BIC) for segmentation and clustering, whereas the second uses K-means and GMM clustering. Furthermore, we explore the use of i-vectors based on different low-level features including MFCC, PLP and RASTA-PLP, as well as fusion of such systems at the decision level. We show the feasibility and the effectiveness of the proposed system in comparison with a frame-based GMM baseline using the challenging RATS dataset in the context of the 2015 NIST OpenSAD evaluation.