Evaluation of an LSTM-RNN System in Different NIST Language Recognition Frameworks

Ruben Zazo, Alicia Lozano-Diez, Joaquin Gonzalez-Rodriguez

Long Short-Term Memory recurrent neural networks (LSTM RNNs) provide an outstanding performance in language identification (LID) due to its ability to model speech sequences. So far, previously published LSTM RNNs solutions for LID deal with highly controlled scenarios, balanced datasets and limited channel variability. In this paper we evaluate an end-to-end LSTM LID system, comparing it against a classical i-vector system, on different environments based on data from Language Recognition Evaluations (LRE) organized by NIST. In order to analyze the behavior we train and test our system on a balanced and controlled subset of LRE09, on the develompent data of LRE15 and, finally, on the evaluation set of LRE15. Our results show that an end-to-end recurrent system clearly outperforms the reference i-vector system in a controlled environment, specially when dealing with short utterances. Nevertheless, our deep learning approach is more sensitive to unbalanced datasets, channel variability and, specially, to the mismatch between development and test datasets.

Switch Camera

Odyssey 2016

The Speaker and Language Recognition Workshop

Evaluation of an LSTM-RNN System in Different NIST Language Recognition Frameworks

Search in Audio

Speech Transcript

Related Recordings

On autoencoders in the i-vector space for speaker recognition

Channel Compensation for Speaker Recognition using MAP Adapted PLDA and Denoising DNNs