Unsupervised Regularization of the Embedding Extractor for Robust Language Identification

Raphaël Duroselle, Denis Jouvet, Irina Illina

State-of-the-art spoken language identification systems are constituted of three modules: a frame-level feature extractor, a segment-level embedding extractor and a final classifier. The performance of these systems degrades when facing mismatch between training and testing data. Most domain adaptation methods focus on adaptation of the final classifier. In this article, we propose a model-based unsupervised domain adaptation of the segment-level embedding extractor. The approach consists of a modification of the loss function used for training the embedding extractor. We introduce a regularization term based on the maximum mean discrepancy loss. Experiments were performed on the RATS corpus with transmission channel mismatch between telephone and radio channels. We obtained the same language identification performance as supervised training on the target domains but without using labeled data from these domains.　

Odyssey 2020

The Speaker and Language Recognition Workshop

Unsupervised Regularization of the Embedding Extractor for Robust Language Identification

Search in Audio

Speech Transcript

Related Recordings

Zero-Time Windowing Cepstral Coefficients for Dialect Classification

Compensation on x-vector for Short Utterance Spoken Language Identification