InterSpeech 2021

The TAL system for the INTERSPEECH2021 Shared Task on Automatic Speech Recognition for Non-Native Children’s Speech
(longer introduction)

Gaopeng Xu (TAL, China), Song Yang (TAL, China), Lu Ma (TAL, China), Chengfei Li (TAL, China), Zhongqin Wu (TAL, China)
This paper describes TAL’s system for the INTERSPEECH 2021 shared task on Automatic Speech Recognition (ASR) for non-native children’s speech. In this work, we attempt to apply the self-supervised approach to non-native German children’s ASR. First, we conduct some baseline experiments to indicate that self-supervised learning can capture more acoustic information on non-native children’s speech. Then, we apply the 11-fold data augmentation and combine it with data clean-up to supplement to the limited training data. Moreover, an in-domain semi-supervised VAD model is utilized to segment untranscribed audio. These strategies can significantly improve the system performance. Furthermore, we use two types of language models to further improve performance, i.e., a 4-gram LM with CTC beam-search and a Transformer LM for 2-pass rescoring. Our ASR system reduces the Word Error Rate (WER) by about 48% relatively in comparison with the baseline, achieving 1st in the evaluation period with the WER of 23.5%.