Align-Denoise: Single-Pass Non-Autoregressive Speech Recognition <BR>(Oral presentation)

Align-Denoise: Single-Pass Non-Autoregressive Speech Recognition
(Oral presentation)

Nanxin Chen (Johns Hopkins University, USA), Piotr Żelasko (Johns Hopkins University, USA), Laureano Moro-Velázquez (Johns Hopkins University, USA), Jesús Villalba (Johns Hopkins University, USA), Najim Dehak (Johns Hopkins University, USA)

Deep autoregressive models start to become comparable or superior to the conventional systems for automatic speech recognition. However, for the inference computation, they still suffer from inference speed issue due to their token-by-token decoding characteristic. Non-autoregressive models greatly improve decoding speed by supporting decoding within a constant number of iterations. For example, Align-Refine was proposed to improve the performance of the non-autoregressive system by refining the alignment iteratively. In this work, we propose a new perspective to connect Align-Refine and denoising autoencoder. We introduce a novel noisy distribution to sample the alignment directly instead of obtaining it from the decoder output. The experimental results reveal that the proposed Align-Denoise speeds up both training and inference with performance improvement up to 5% relatively using single-pass decoding.

InterSpeech 2021

Align-Denoise: Single-Pass Non-Autoregressive Speech Recognition
(Oral presentation)

Search in Audio

Related Recordings

WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis
(Oral presentation)

VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis
(Oral presentation)

InterSpeech 2021

Align-Denoise: Single-Pass Non-Autoregressive Speech Recognition (Oral presentation)

Search in Audio

Related Recordings

WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis (Oral presentation)

VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis (Oral presentation)

Align-Denoise: Single-Pass Non-Autoregressive Speech Recognition
(Oral presentation)

WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis
(Oral presentation)

VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis
(Oral presentation)