Explore Wav2vec 2.0 for Mispronunciation Detection <BR>(3 minutes introduction)

Explore Wav2vec 2.0 for Mispronunciation Detection
(3 minutes introduction)

Xiaoshuo Xu (Tencent, China), Yueteng Kang (Tencent, China), Songjun Cao (Tencent, China), Binghuai Lin (Tencent, China), Long Ma (Tencent, China)

This paper presents an initial attempt to use self-supervised learning for Mispronunciation Detection. Unlike existing methods that use speech recognition corpus to train models, we exploit unlabeled data and utilize a self-supervised learning technique, Wav2vec 2.0, for pretraining. After the pretraining process, the training process only requires a little pronunciation-labeled data for finetuning. Formulating Mispronunciation Detection as a binary classification task, we add convolutional and pooling layers on the top of the pretrained model to detect mispronunciations of the given prompted texts within the alignment segmentations. The training process is simple and effective. Several experiments are conducted to validate the effectiveness of the pretrained method. Our approach outperforms existing methods on a public dataset L2-ARCTIC with a F1 value of 0.610.

Loading player

InterSpeech 2021

Explore Wav2vec 2.0 for Mispronunciation Detection
(3 minutes introduction)

Search in Audio

Related Recordings

End-to-End Speaker-Attributed ASR with Transformer
(3 minutes introduction)

Lexical Density Analysis of Word Productions in Japanese English Using Acoustic Word Embeddings
(3 minutes introduction)

InterSpeech 2021

Explore Wav2vec 2.0 for Mispronunciation Detection (3 minutes introduction)

Search in Audio

Related Recordings

End-to-End Speaker-Attributed ASR with Transformer (3 minutes introduction)

Lexical Density Analysis of Word Productions in Japanese English Using Acoustic Word Embeddings (3 minutes introduction)

Explore Wav2vec 2.0 for Mispronunciation Detection
(3 minutes introduction)

End-to-End Speaker-Attributed ASR with Transformer
(3 minutes introduction)

Lexical Density Analysis of Word Productions in Japanese English Using Acoustic Word Embeddings
(3 minutes introduction)