InterSpeech 2021

EfficientSing: A Chinese Singing Voice Synthesis System Using Duration-Free Acoustic Model and HiFi-GAN Vocoder
(3 minutes introduction)

Zhengchen Liu (Ping An Technology, China), Chenfeng Miao (Ping An Technology, China), Qingying Zhu (Ping An Technology, China), Minchuan Chen (Ping An Technology, China), Jun Ma (Ping An Technology, China), Shaojun Wang (Ping An Technology, China), Jing Xiao (Ping An Technology, China)
In this paper, we present EfficientSing, a Chinese singing voice synthesis (SVS) system based on a non-autoregressive duration-free acoustic model and HiFi-GAN neural vocoder. Different from many existing SVS methods, no auxiliary duration prediction module is needed in this work, since a newly proposed monotonic alignment modeling mechanism is adopted. Moreover, we follow the non-autoregressive architecture of EfficientTTS with some singing-specific adaption, making training and inference fully parallel and efficient. HiFi-GAN vocoder is adopted to improve the voice quality of synthesized songs and inference efficiency. Both objective and subjective experimental results show that the proposed system can produce quite natural and high-fidelity songs and outperform the Tacotron-based baseline in terms of pronunciation, pitch and rhythm.