|Xiaoshuo Xu (Tencent, China), Yueteng Kang (Tencent, China), Songjun Cao (Tencent, China), Binghuai Lin (Tencent, China), Long Ma (Tencent, China)|
This paper presents an initial attempt to use self-supervised learning for Mispronunciation Detection. Unlike existing methods that use speech recognition corpus to train models, we exploit unlabeled data and utilize a self-supervised learning technique, Wav2vec 2.0, for pretraining. After the pretraining process, the training process only requires a little pronunciation-labeled data for finetuning. Formulating Mispronunciation Detection as a binary classification task, we add convolutional and pooling layers on the top of the pretrained model to detect mispronunciations of the given prompted texts within the alignment segmentations. The training process is simple and effective. Several experiments are conducted to validate the effectiveness of the pretrained method. Our approach outperforms existing methods on a public dataset L2-ARCTIC with a F1 value of 0.610.