Learning Mixture Representation for Deep Speaker Embedding Using Attention

Weiwei Lin, Man Wai Mak, Lu Yi

Almost all speaker recognition systems involve a step that converts a sequence of frame-level features to a fixed dimension representation. In the context of deep neural networks, it is referred to as statistics pooling. In state-of-the-art speak recognition systems, statistics pooling is implemented by concatenating the mean and standard deviation of a sequence of frame-level features. However, a single mean and standard deviation are very limited descriptive statistics for an acoustic sequence even with a powerful feature extractor like a convolutional neural network. In this paper, we propose a novel statistics pooling method that can produce more descriptive statistics through a mixture representation. Our method is inspired by the expectation-maximization (EM) algorithm in Gaussian mixture models (GMMs). However, unlike the GMMs, the mixture assignments are given by an attention mechanism instead of the Euclidean distances between frame-level features and explicit centers. Applying the proposed attention mechanism to a 121-layer Densenet, we achieve an EER of 1.1\% in VoxCeleb1 and an EER of 4.77\% in VOiCES 2019 evaluation set.　

Odyssey 2020

The Speaker and Language Recognition Workshop

Learning Mixture Representation for Deep Speaker Embedding Using Attention

Search in Audio

Speech Transcript

Related Recordings

An Empirical Analysis of Information Encoded in Disentangled Neural Speaker Representations

NPLDA: A Deep Neural PLDA Model for Speaker Verification