Combined Vector Based on Factorized Time-delay Neural Network for Text-Independent Speaker Recognition

Tianyu Liang, Yi Liu, Can Xu, Xianwei Zhang, Liang He

Currently, the most effective text-independent speaker recognition method has turned to be extracting speaker embedding from various deep neural networks. Among them, the x-vector extracted from factorized time-delay neural network (F-TDNN) has been demonstrated to be among the best performance on recent NIST SRE evaluations. In our previous works, we have proposed combined vector (c-vector) and proved that the performance can be further improved by introducing phonetic information, which is often ignored in extracting x-vectors. By taking advantages of both F-TDNN and c-vector, we propose an embedding extraction method termed as factorized combined vector (fc-vector). In the NIST SRE18 CTS task, the EER and minDCF18 of fc-vector are 12.1% and 10.5% relatively lower than the x-vector, and 3.4% and 3.9% relatively lower than the c-vector, respectively.　

Odyssey 2020

The Speaker and Language Recognition Workshop

Combined Vector Based on Factorized Time-delay Neural Network for Text-Independent Speaker Recognition

Search in Audio

Speech Transcript

Related Recordings

Speaker Characterization Using TDNN, TDNN-LSTM, TDNN-LSTM-Attention based Speaker Embeddings for NIST SRE 2019

Personal VAD: Speaker-Conditioned Voice Activity Detection