|Mitchell Mclaren, Md Hafizur Rahman, Diego Castan, Mahesh Kumar Nandwana, Aaron Lawson
We propose an active learning approach for the unsupervised normalization of vector representations of speech, such as speaker embeddings, currently in widespread use for speaker recognition systems. We demonstrate that the traditionally used mean for normalization of speaker embeddings prior to probabilistic linear discriminant analysis (PLDA) is suboptimal when the evaluation conditions do not match the training conditions. Using an unlabeled sample of target-domain data, we show that the proposed adaptive mean normalization (AMN) technique is extremely effective for improving discrimination and calibration performance, by up to 26% and 65% relative over out-of-the-box system performance. These benchmarks were performed on four distinctly different datasets for a thorough analysis of AMN robustness. Most notably, for a range of data conditions, AMN enabled the use of a calibration model trained on data mismatched to the conditions being evaluated. The approach was found to be effective when using as few as thirty-two unlabeled samples of target-domain data.