Joint Training End-to-End Speech Recognition Systems with Speaker Attributes

Sheng Li, Xugang Lu, Raj Dabre, Peng Shen, Hisashi Kawai

The end-to-end (E2E) model allows for simplifying the conventional automatic speech recognition (ASR) systems. It integrates the acoustic model, lexicon, and language model into one neural network. In this paper, we focus on improving the performance of the state-of-the-art transformer-based E2E ASR system (ASR-Transformer). We propose to joint train the compressed ASR-Transformer with speaker recognition (SR) tasks. As a common practice, speaker-ids are used for joint training the ASR and SR tasks. However, this leads to no significant improvement. To address this problem, we propose to augment the labels with bags-of-attributes of speakers instead of simple speaker-ids. Experiments show the proposed method can effectively improve the performance of compressed ASR-Transformer on CSJ corpus. Moreover, the proposed bags-of-attributes method has the potential to be used for building a highly customized ASR system.　

Odyssey 2020

The Speaker and Language Recognition Workshop

Joint Training End-to-End Speech Recognition Systems with Speaker Attributes

Search in Audio

Speech Transcript

Related Recordings

Small Footprint Multi-channel Keyword Spotting

Assessing Child Communication Engagement via Speech Recognition in Naturalistic Active Learning Spaces