Reducing Exposure Bias in Training Recurrent Neural Network Transducers <BR>(3 minutes introduction)

Reducing Exposure Bias in Training Recurrent Neural Network Transducers
(3 minutes introduction)

Xiaodong Cui (IBM, USA), Brian Kingsbury (IBM, USA), George Saon (IBM, USA), David Haws (IBM, USA), Zoltán Tüske (IBM, USA)

When recurrent neural network transducers (RNNTs) are trained using the typical maximum likelihood criterion, the prediction network is trained only on ground truth label sequences. This leads to a mismatch during inference, known as exposure bias, when the model must deal with label sequences containing errors. In this paper we investigate approaches to reducing exposure bias in training to improve the generalization of RNNT models for automatic speech recognition (ASR). A label-preserving input perturbation to the prediction network is introduced. The input token sequences are perturbed using SwitchOut and scheduled sampling based on an additional token language model. Experiments conducted on the 300-hour Switchboard dataset demonstrate their effectiveness. By reducing the exposure bias, we show that we can further improve the accuracy of a high-performance RNNT ASR model and obtain state-of-the-art results on the 300-hour Switchboard dataset.

Search in Audio

Related Recordings

An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling
(3 minutes introduction)

Tara N. Sainath , Yanzhang He , Arun Narayanan , Rami Botros , Ruoming Pang , David Rybach , Cyril Allauzen , Ehsan Variani , James Qin , Quoc-Nam Le-The , Shuo-Yiin Chang , Bo Li , Anmol Gulati , Jiahui Yu , Chung-Cheng Chiu , Diamantino Caseiro , Wei Li , Qiao Liang , Pat Rondon

Bridging the gap between streaming and non-streaming ASR systems by distilling ensembles of CTC and RNN-T models
(3 minutes introduction)

Thibault Doutre , Wei Han , Chung-Cheng Chiu , Ruoming Pang , Olivier Siohan , Liangliang Cao

InterSpeech 2021

Reducing Exposure Bias in Training Recurrent Neural Network Transducers (3 minutes introduction)

Search in Audio

Related Recordings

An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling (3 minutes introduction)

Bridging the gap between streaming and non-streaming ASR systems by distilling ensembles of CTC and RNN-T models (3 minutes introduction)

Reducing Exposure Bias in Training Recurrent Neural Network Transducers
(3 minutes introduction)

An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling
(3 minutes introduction)

Bridging the gap between streaming and non-streaming ASR systems by distilling ensembles of CTC and RNN-T models
(3 minutes introduction)