Joint Feature Enhancement and Speaker Recognition with Multi-Objective Task-Oriented Network
(3 minutes introduction)
|Yibo Wu (Tianjin University, China), Longbiao Wang (Tianjin University, China), Kong Aik Lee (A*STAR, Singapore), Meng Liu (Tianjin University, China), Jianwu Dang (Tianjin University, China)|
Recently, increasing attention has been paid to the joint training of upstream and downstream tasks, and to address the challenge of how to synchronize various loss functions in a multi-objective scenario. In this paper, to address the competing gradient directions between the speaker classification loss and the feature enhancement loss, we propose an asynchronous subregion optimization approach for the joint training of feature enhancement and speaker embedding neural networks. For the asynchronous subregion optimization, the squeeze and excitation (SE) method is introduced in the enhancement network to adaptively select important channels for speaker embedding. Furthermore, channel-wise feature concatenation is applied between the input feature and the enhanced feature to address the distortion of speaker information that is caused by enhancement loss. By using the proposed joint training network with asynchronous subregion optimization and channel-wise feature concatenation, we obtained relative gains of 11.95% and 6.43% in equal error rate on a noisy version of Voxceleb1 and VOiCES corpus, respectively.