InterSpeech 2021

Leveraging non-target language resources to improve ASR performance in a target language
(longer introduction)

Jayadev Billa (University of Southern California, USA)
This paper investigates approaches to improving automatic speech recognition (ASR) performance in a target language using resources in other languages. In particular, we assume that we have untranscribed speech in a different language and a well trained ASR system in yet another language. Concretely, we structure this as a multi-task problem, where the primary task is acoustic model training in the target language, and the secondary task is also acoustic model training but using a synthetic data set. The synthetic data set consists of pseudo transcripts generated by decoding the untranscribed speech using a well trained ASR model. We compare and contrast this with using labeled data sets, i.e. matched audio and human-generated transcripts, and show that our approach compares favorably. In most cases, we see performance improvements, and in some cases, depending on the selection of languages and nature of speech data, performance exceeds that of systems using labeled data sets as the secondary task. When extended to larger sets of data, we show that the mismatched data approach performs similarly to in-language semi-supervised training (SST) when the secondary task pseudo transcripts are generated by ASR models trained on large diverse data sets.