|Chi-Hang Leong (NYCU, Taiwan), Yu-Han Huang (NYCU, Taiwan), Jen-Tzung Chien (NYCU, Taiwan)|
Traditionally, transformer with connectionist temporal classification (CTC) was developed for offline speech recognition where the transcription was generated after the whole utterance has been spoken. However, it is crucial to carry out online transcription of speech signal for many applications including live broadcasting and meeting. This paper presents an online transformer for real-time speech recognition where online transcription is generated chunk by chuck. In particular, an online compressive transformer (OCT) is proposed for end-to-end speech recognition. This OCT aims to generate immediate transcription for each audio chunk while the comparable performance with offline speech recognition can be still achieved. In the implementation, OCT tightly combines with both CTC and recurrent neural network transducer by minimizing their losses for training. In addition, this OCT systematically merges with compressive memory to reduce potential performance degradation due to online processing. This degradation is caused by online transcription which is generated by the chunks without history information. The experiments on speech recognition show that OCT does not only obtain comparable performance with offline transformer, but also work faster than the baseline model.