|Haibin Yu (Tsinghua University, China), Jing Zhao (Tsinghua University, China), Song Yang (TAL, China), Zhongqin Wu (TAL, China), Yuting Nie (Tsinghua University, China), Wei-Qiang Zhang (Tsinghua University, China)|
Unsupervised pretrained models have been proven to rival or even outperform supervised systems in various speech recognition tasks. However, their performance for language recognition is still left to be explored. In this paper, we construct several language recognition systems based on existing unsupervised pretraining approaches, and explore their credibility and performance to learn high-level generalization of language. We discover that unsupervised pretrained models capture expressive and highly linear-separable features. With these representations, language recognition can perform well even when the classifiers are relatively simple or only a small amount of labeled data is available. Although linear classifiers are usable, neural nets with RNN structures improve the results. Meanwhile, unsupervised pretrained models are able to gain refined representations on audio frame level that are strongly coupled with the acoustic features of the input sequence. Therefore these features contain redundant information of speakers and channels with few relations to the identity of the language. This nature of unsupervised pretrained models causes a performance degradation in language recognition tasks on cross-channel tests.