Application of Convolutional Neural Networks to Language Identification in Noisy Conditions

Yun Lei, Luciana Ferrer, Aaron Lawson, Mitchell McLaren and Nicolas Scheffer

This paper proposes two novel frontends for robust language identification (LID) using a convolutional neural network (CNN) trained for automatic speech recognition (ASR). In the CNN/i-vector frontend, the CNN is used to obtain the posterior probabilities for i-vector training and extraction instead of a universal background model (UBM). The CNN/posterior frontend is somewhat similar to a phonetic system in that the occupation counts of (tied) triphone states (senones) given by the CNN are used for classification. They are compressed to a low dimensional vector using probabilistic principal component analysis (PPCA). Evaluated on heavily degraded speech data, the proposed front ends provide significant improvements of up to 50% on average equal error rate compared to a UBM/i-vector baseline. Moreover, the proposed frontends are complementary and give significant gains of up to 20% relative to the best single system when combined.

Odyssey 2014

The Speaker and Language Recognition Workshop

Application of Convolutional Neural Networks to Language Identification in Noisy Conditions

Search in Audio

Speech Transcript

Related Recordings

Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition

Neural Network Bottleneck Features for Language Identification