InterSpeech 2021

Improving Time Delay Neural Network Based Speaker Recognition With Convolutional Block And Feature Aggregation Methods
(longer introduction)

Yu-Jia Zhang (National Sun Yat-sen University, Taiwan), Yih-Wen Wang (National Sun Yat-sen University, Taiwan), Chia-Ping Chen (National Sun Yat-sen University, Taiwan), Chung-Li Lu (Chunghwa Telecom Laboratories, Taiwan), Bo-Cheng Chan (Chunghwa Telecom Laboratories, Taiwan)
In this paper, we develop a system that integrates multiple ideas and techniques inspired by the convolutional block and feature aggregation methods. We begin with the state-of-the-art speaker-embedding model for speaker recognition, namely the model of Emphasized Channel Attention, Propagation, and Aggregation in Time Delay Neural Network, and then gradually experiment with the proposed network modules, including bottleneck residual blocks, attention mechanisms, and feature aggregation methods. In our final model, we replace the Res2Block with SC-Block and we use a hierarchical architecture for feature aggregation. We evaluate the performance of our model on the VoxCeleb1 test set and the 2020 VoxCeleb Speaker Recognition Challenge (VoxSRC20) validation set. The relative improvement of the proposed models over ECAPA-TDNN is 22.8% on VoxCeleb1 and 18.2% on VoxSRC20.