Deep feature transfer learning for automatic pronunciation assessment <BR>(3 minutes introduction)

Deep feature transfer learning for automatic pronunciation assessment
(3 minutes introduction)

Binghuai Lin (Tencent, China), Liyuan Wang (Tencent, China)

Automatic pronunciation assessment is commonly developed to evaluate pronunciation quality of second language (L2) learners. Traditional methods for automatic pronunciation assessment normally utilize speech features such as Goodness of pronunciation (GOP), which may not provide sufficient information for the pronunciation proficiency assessment [1]. In this paper, we propose a transfer learning method for automatic pronunciation assessment. We directly utilize the deep features from the acoustic model instead of traditional features such as GOP, and transfer the acoustic knowledge from ASR to a specific scoring module. The scoring module is designed to consider the relationship among different granularities in an utterance based on an attention mechanism. Only this module is updated for faster transfer and adaptation of various pronunciation assessment tasks. Experimental results based on the dataset recorded by Chinese English-as-second-language (ESL) learners and the Speechocean762 dataset demonstrate that the proposed method outperforms the traditional GOP-based baselines in Pearson correlation coefficient (PCC) and yields parameter-efficient transfer for different pronunciation assessment tasks.

InterSpeech 2021

Deep feature transfer learning for automatic pronunciation assessment
(3 minutes introduction)

Search in Audio

Related Recordings

Lexical Density Analysis of Word Productions in Japanese English Using Acoustic Word Embeddings
(3 minutes introduction)

"You don't understand me!": Comparing ASR results for L1 and L2 speakers of Swedish
(3 minutes introduction)

InterSpeech 2021

Deep feature transfer learning for automatic pronunciation assessment (3 minutes introduction)

Search in Audio

Related Recordings

Lexical Density Analysis of Word Productions in Japanese English Using Acoustic Word Embeddings (3 minutes introduction)

"You don't understand me!": Comparing ASR results for L1 and L2 speakers of Swedish (3 minutes introduction)

Deep feature transfer learning for automatic pronunciation assessment
(3 minutes introduction)

Lexical Density Analysis of Word Productions in Japanese English Using Acoustic Word Embeddings
(3 minutes introduction)

"You don't understand me!": Comparing ASR results for L1 and L2 speakers of Swedish
(3 minutes introduction)