|Po-Chin Wang, Chia-Ping Chen, Chung-Li Lu, Bo-Cheng Chan, Shan-Wen Hsiao|
In this paper, we integrate multiple ideas and techniques into an embedding-based neural-network speaker recognition (NSR) system. Such an NSR system essentially consists of a front-end speaker-embedding extractor and a back-end speaker-matching component. The frontend is a neural network trained with millions of utterances from thousands of speakers. Currently, the backend is based on simple similarity measures such as angle, Euclidean distance, or probabilistic score. We begin with the well-known x-vector baseline, and then incrementally modify the system modules. Regarding front-end extractor, we investigate modification on network architecture, network function, training criteria, and hyper-parameter setting. Regarding back-end matcher, we evaluate PLDA training/adaptation data and system fusion. On the public SRE 2018 Evaluation Dataset, the performance of system as measured by equal-error rate (EER) is improved from 7.01% to 5.16%, which marks a significant relative improvement of 26.5%.