InterSpeech 2021

Emitting Word Timings with HMM-free End-to-End System in Automatic Speech Recognition
(3 minutes introduction)

Xianzhao Chen (Tianjin University, China), Hao Ni (ByteDance, China), Yi He (ByteDance, China), Kang Wang (ByteDance, China), Zejun Ma (ByteDance, China), Zongxia Xie (Tianjin University, China)
Word timings, which mark the start and end times of each word in ASR results, play an important part in many applications, such as computer assisted language learning. To date, end-to-end (E2E) systems outperform conventional DNN-HMM hybrid systems in ASR accuracy but have challenges to obtain accurate word timings. In this paper, we propose a two-pass method to estimate word timings under an E2E-based LAS modeling framework, which is completely free of using the DNN-HMM ASR system. Specifically, we first employ the LAS system to obtain word-piece transcripts of the input audio, we then compute forced-alignments with a frame-level-based word-piece classifier. In order to make the classifier yield accurate word-piece timing results, we propose a novel objective function to learn the classifier, utilizing the spike timings of the connectionist temporal classification (CTC) model. On Librispeech data, our E2E-based LAS system achieves 2.8%/7.0% WERs, while its word timing (start/end) accuracy are 99.0%/95.3% and 98.6%/93.7% on test-clean and test-other two test sets respectively. Compared with a DNN-HMM hybrid ASR system (here, TDNN), the LAS system is better in ASR performance, and the generated word timings are close to what the TDNN ASR system presents.