|Yingke Zhu, Brian Mak|
This paper seeks to explore orthogonal training in end-to-end speaker verification (SV) tasks. In various end-to-end speaker verification systems, cosine similarity has been used as the distance measurement for speaker embeddings. However, the effectiveness of cosine similarity is based on the assumption that the dimensions of the speaker embeddings are orthogonal. In our previous orthogonal training work, we have shown that in SV systems with cosine similarity backend, introducing orthogonality on the weights in speaker-discriminative deep neural networks can significantly improve the system performance. In this paper, we introduce two orthogonality regularizers to end-to-end speaker verification systems. The first one is based on the Frobenius norm, and the second one utilizes restricted isometry property. Both regularization methods can be handily incorporated into end-to-end training. We build systems based on the state-of-the-art end-to-end models. Two network architectures, LSTM and TDNN, are used in order to investigate the effects of orthogonality regularization on different types of models. Systems are assessed on the Voxceleb corpus and significant gains are obtained with our new regularized orthogonal training.