|Ma Jin, Yan Song, Ian Mcloughlin, Lirong Dai, Zhongfu Ye|
A key problem in spoken language identification (LID) is how to effectively model features from a given speech utterance. Recent techniques such as end-to-end schemes and deep neural networks (DNNs) utilising transfer learning such as bottleneck (BN) features, have demonstrated good overall performance, but have not addressed the extraction of LID-specific features. We thus propose a novel end-to-end neural network which aims to obtain effective LID-senone representations, which we define as being analogous to senones in speech recognition. We show that LID-senones combine a compact representation of the original acoustic feature space with a powerful descriptive and discriminative capability. Furthermore, a novel incremental training method is adopted to excavate the weak language information buried in the acoustic feature from insufficient language resources. Results on the six most confused languages in NIST LRE 2009 show good performance compared to state-of-the-art BN-GMM/i-vector and BN-DNN/i-vector systems. The proposed end-to-end network, coupled with an incremental training method which mitigates against over-fitting, has potential not just for LID, but also for other resource constrained tasks.