InterSpeech 2021

Adjunct-Emeritus Distillation for Semi-Supervised Language Model Adaptation
(3 minutes introduction)

Scott Novotney (Amazon, USA), Yile Gu (Amazon, USA), Ivan Bulyko (Amazon, USA)
To improve customer privacy, commercial speech applications are reducing human transcription of customer data. This has a negative impact on language model training due to a smaller amount of in-domain transcripts. Prior work demonstrated that training on automated transcripts alone provides modest gains due to reinforcement of recognition errors. We consider a new condition, where a model trained on historical human transcripts, but not the transcripts themselves, are available to us. To overcome temporal drift in vocabulary and topics, we propose a novel extension of knowledge distillation, adjunct-emeritus distillation where two imperfect teachers jointly train a student model. We conduct experiments on an English voice assistant domain and simulate a one year gap in human transcription. Unlike fine-tuning, our approach is architecture agnostic and achieves a 14% relative reduction in perplexity over the baseline approach of freezing model development and improves over the baseline of knowledge distillation.