Information Preservation Pooling for Speaker Embedding

Min Hyun Han, Woo Hyun Kang, Sung Hwan Mun, Nam Soo Kim

Many recent studies on speaker embedding focused on the pooling technique. In the task of speaker recognition, pooling plays an important role of summarizing inputs with variable length into a fixed dimensional output. One of the most popular pooling method for text-independent speaker verification system is attention based pooling method which utilizes an attention mechanism to give different weights to each frame. Utterance-level features are generated by computing weighted means and standard deviations of frame-level features. However, useful information in frame-level features can be compromised during the pooling step. In this paper, we propose a information preservation pooling method that exploits a mutual information neural estimator to preserve local information in frame-level features during the pooling step. We conducted the evaluation on VoxCeleb datasets, which shows that the proposed method reduces equal error rate from the conventional method by 14.6%　

Odyssey 2020

The Speaker and Language Recognition Workshop

Information Preservation Pooling for Speaker Embedding

Search in Audio

Speech Transcript

Related Recordings

Improving Embedding-based Neural-Network Speaker Recognition

Neural i-vectors