Adversarially Learning Disentangled Speech Representations for Robust Multi-factor Voice Conversion
(3 minutes introduction)
|Jie Wang (Tsinghua University, China), Jingbei Li (Tsinghua University, China), Xintao Zhao (Tsinghua University, China), Zhiyong Wu (Tsinghua University, China), Shiyin Kang (Huya, China), Helen Meng (Tsinghua University, China)|
Factorizing speech as disentangled speech representations is vital to achieve highly controllable style transfer in voice conversion (VC). Conventional speech representation learning methods in VC only factorize speech as speaker and content, lacking controllability on other prosody-related factors. State-of-the-art speech representation learning methods for more speech factors are using primary disentangle algorithms such as random resampling and ad-hoc bottleneck layer size adjustment, which however is hard to ensure robust speech representation disentanglement. To increase the robustness of highly controllable style transfer on multiple factors in VC, we propose a disentangled speech representation learning framework based on adversarial learning. Four speech representations characterizing content, timbre, rhythm and pitch are extracted, and further disentangled by an adversarial Mask-And-Predict (MAP) network inspired by BERT. The adversarial network is used to minimize the correlations between the speech representations, by randomly masking and predicting one of the representations from the others. Experimental results show that the proposed framework significantly improves the robustness of VC on multiple factors by increasing the speech quality MOS from 2.79 to 3.30 and decreasing the MCD from 3.89 to 3.58.