Multitask Training with Text Data for End-to-End Speech Recognition <BR>(3 minutes introduction)

Multitask Training with Text Data for End-to-End Speech Recognition
(3 minutes introduction)

Peidong Wang (Google, USA), Tara N. Sainath (Google, USA), Ron J. Weiss (Google, USA)

We propose a multitask training method for attention-based end-to-end speech recognition models. We regularize the decoder in a listen, attend, and spell model by multitask training it on both audio-text and text-only data. Trained on the 100-hour subset of LibriSpeech, the proposed method, without requiring an additional language model, leads to an 11% relative performance improvement over the baseline and approaches the performance of language model shallow fusion on the test-clean evaluation set. We observe a similar trend on the whole 960-hour LibriSpeech training set. Analyses of different types of errors and sample output sentences demonstrate that the proposed method can incorporate language level information, suggesting its effectiveness in real-world applications.

InterSpeech 2021

Multitask Training with Text Data for End-to-End Speech Recognition
(3 minutes introduction)

Search in Audio

Related Recordings

Regularizing Word Segmentation by Creating Misspellings
(3 minutes introduction)

Emitting Word Timings with HMM-free End-to-End System in Automatic Speech Recognition
(3 minutes introduction)

InterSpeech 2021

Multitask Training with Text Data for End-to-End Speech Recognition (3 minutes introduction)

Search in Audio

Related Recordings

Regularizing Word Segmentation by Creating Misspellings (3 minutes introduction)

Emitting Word Timings with HMM-free End-to-End System in Automatic Speech Recognition (3 minutes introduction)

Multitask Training with Text Data for End-to-End Speech Recognition
(3 minutes introduction)

Regularizing Word Segmentation by Creating Misspellings
(3 minutes introduction)

Emitting Word Timings with HMM-free End-to-End System in Automatic Speech Recognition
(3 minutes introduction)