|Trung Ngo Trong, Ville Hautamäki, Kong Aik Lee|
This work explores the use of various Deep Neural Network (DNN) architectures for an end-to-end language identification (LID) task. The approach has been proven to significantly improve the state-of-art in many domains include speech recognition, computer vision and genomics. As an end-to-end system, deep learning removes the burden of hand crafting the feature extraction as conventional approach to LID. This versatility is achieved by training a very deep network to learn distributed representations of speech features with multiple levels of abstraction. In this paper, we show that an end-to-end deep learning system can be used to recognize language from speech utterances with various lengths. Our results show that a combination of three deep architectures: feed-forward network, convolutional network and recurrent network can achieve the best performance compared to other network designs. Additionally, we compare our network performance to state-of-the-art BNF-based i-vector system on NIST 2015 Language Recognition Evaluation corpus. Key to our approach is that we effectively address computational and regularization issues into the network structure to build deeper architecture compare to any previous DNN approaches to language recognition task.