InterSpeech 2021

(longer introduction)

Zhiming Wang (Ant, China), Furong Xu (Ant, China), Kaisheng Yao (Ant, China), Yuan Cheng (Ant, China), Tao Xiong (Ant, China), Huijia Zhu (Ant, China)
This paper presents a comprehensive description of the AntVoice system for the first two tracks of far-field speaker verification from single microphone array in FFSVC 2020 [1]. The system is based on neural speaker embeddings from deep neural network-based encoder networks. These encoder networks for acoustic modeling include 2D convolutional residual-like networks that are shown to be effective on the tasks. Specifically, we apply the Squeeze-and-Excitation residual network (SE-ResNet) [2] to model cross-channel inter-dependency information. On short utterances, we observe that SE-ResNet outperforms alternative methods in the text-dependent verification task. The system adopts a joint loss function that combines the additive cosine margin softmax loss [3] with the equidistant triplet-based loss[4]. This loss function results in performance gains with more discriminative speaker embeddings from enhanced intra-class similarity and increased inter-class variances. We also apply speech enhancement and data augmentation to improve data quality and diversity. Even without using model ensembles, the proposed system significantly outperforms the baselines [1] in both tracks of the speaker verification challenge. With fusion of several encoder neural networks, this system is able to achieve further performance improvements consistently. In the end, the AntVoice system achieves the third place in the text-dependent verification task.