InterSpeech 2021

End to end transformer-based contextual speech recognition based on pointer network
(3 minutes introduction)

Binghuai Lin (Tencent, China), Liyuan Wang (Tencent, China)
Most spoken language assessment systems rely on the text features extracted from the automatic speech recognition (ASR) transcripts and thus depend heavily on the accuracy of the ASR systems. Automatic speech scoring tasks such as reading aloud and spontaneous speech are commonly provided with the prompts in advance to guide test takers’ answers, which contain information that should be included in the answers (e.g., listening passage, and sample response). Utilizing these texts to improve ASR performance is of great importance for these tasks. In this paper, we develop an end-to-end (E2E) ASR system incorporating contextual information provided by prompts. Specifically, we add an extra prompt encoder to a transformer-based E2E ASR system. To fuse the probabilities of the ASR output and the prompts dynamically, we train a soft gate based on the pointer network with carefully constructed prompt training corpus. We experiment the proposed method with data collected from English speaking proficiency tests recorded by Chinese teenagers from 16 to 18 years old. The results show the improved performance of speech recognition with a nearly 50% drop in word error rate (WER) utilizing prompts. Furthermore, the proposed network performs well in rare word recognition such as locations and personal names.