Improving Embedding-based Neural-Network Speaker Recognition

Po-Chin Wang, Chia-Ping Chen, Chung-Li Lu, Bo-Cheng Chan, Shan-Wen Hsiao

In this paper, we integrate multiple ideas and techniques into an embedding-based neural-network speaker recognition (NSR) system. Such an NSR system essentially consists of a front-end speaker-embedding extractor and a back-end speaker-matching component. The frontend is a neural network trained with millions of utterances from thousands of speakers. Currently, the backend is based on simple similarity measures such as angle, Euclidean distance, or probabilistic score. We begin with the well-known x-vector baseline, and then incrementally modify the system modules. Regarding front-end extractor, we investigate modification on network architecture, network function, training criteria, and hyper-parameter setting. Regarding back-end matcher, we evaluate PLDA training/adaptation data and system fusion. On the public SRE 2018 Evaluation Dataset, the performance of system as measured by equal-error rate (EER) is improved from 7.01% to 5.16%, which marks a significant relative improvement of 26.5%.　

Odyssey 2020

The Speaker and Language Recognition Workshop

Improving Embedding-based Neural-Network Speaker Recognition

Search in Audio

Speech Transcript

Related Recordings

Compensation on x-vector for Short Utterance Spoken Language Identification

Information Preservation Pooling for Speaker Embedding