|Yufei Liu (Tencent, China), Chengzhu Yu (Tencent, China), Wang Shuai (Tencent, China), Zhenchuan Yang (Tencent, China), Yang Chao (Tencent, China), Weibin Zhang (SCUT, China)|
This paper proposes a non-parallel any-to-many voice conversion (VC) approach with a novel statistics replacement layer. Non-parallel VC is usually achieved by firstly disentangling linguistic and speaker representations, and then concatenating the linguistic content with the learned target speaker’s embedding at the conversion stage. While such a concatenation-based approach could introduce speaker-specific characteristics into the network, it is not very effective as it entirely relies on the network to learn to combine the linguistic content and the speaker characteristics. Inspired by X-vectors, where the statistics of hidden representation such as means and standard deviations are used for speaker differentiation, we propose a statistics replacement layer in VC systems to directly modify the hidden states to have the target speaker’s statistics. The speaker-specific statistics of hidden states are learned for each target speaker during training and are used as guidance for the statistics replacement layer during inference. Moreover, to better concentrate the speaker information into the statistics of hidden representation, a multitask training with X-vector based speaker classification is also performed. Experimental results with Librispeech and VCTK datasets show that the proposed method can effectively improve the converted speech’s naturalness and similarity.