InterSpeech 2021

SRI-B End-to-End System for Multilingual and Code-Switching ASR Challenges for Low Resource Indian Languages
(3 minutes introduction)

Hardik Sailor (Samsung, India), Kiran Praveen T. (Samsung, India), Vikas Agrawal (Samsung, India), Abhinav Jain (Samsung, India), Abhishek Pandey (Samsung, India)
This paper describes SRI-B’s end-to-end Automated Speech Recognition (ASR) system proposed for the subtask-1 on multilingual ASR challenges for Indian languages. Our end-to-end (E2E) ASR model is based on the transformer architecture trained by jointly minimizing Connectionist Temporal Classification (CTC) & Cross-Entropy (CE) losses. A conventional multilingual model which is trained by pooling data from multiple languages helps in terms of generalization, but it comes at the expense of performance degradation compared to their monolingual counterparts. In our experiments, a multilingual model is trained by conditioning the input features using a language-specific embedding vector. These language-specific embedding vectors are obtained by training a language classifier using an attention-based transformer architecture, and then considering its bottleneck features as language identification (LID) embeddings. We further adapt the multilingual system with language specific data to reduce the degradation on specific languages. We propose a novel hypothesis elimination strategy based on LID scores and length-normalized probabilities that optimally select the model from the pool of available models. The experimental results show that the proposed multilingual training and hypothesis elimination strategy gives an average 3.02% of relative word error recognition (WER) improvement for the blind set over the challenge hybrid ASR baseline system.