Personalized Singing Voice Generation Using WaveRNN

Xiaoxue Gao, Xiaohai Tian, Yi Zhou, Rohan Kumar Das, Haizhou Li

In this paper, we formulate a personalized singing voice generation (SVG) framework using WaveRNN with non-parallel training data. We develop an average singing voice generation model using WaveRNN from multi-singer's vocals. To map singing Phonetic PosteriorGrams and prosody features from singing template to time-domain singing samples, a speaker i-vector extracted from target speech is used to control the speaker identity of the generated singing. At run-time, a singing template and target speech samples are used for target singing vocal generation. Specifically, the content and the speaker identity of the target speech is not necessarily the same as that of the singing template. Experimental results on the NUS-48E and NUS-HLT-SLS corpora suggest that the personalized SVG framework outperforms the traditional conversion-vocoder pipeline in the subjective and objective evaluations.　

Odyssey 2020

The Speaker and Language Recognition Workshop

Personalized Singing Voice Generation Using WaveRNN

Search in Audio

Speech Transcript

Related Recordings

Generative Adversarial Networks for Singing Voice Conversion with and without Parallel Data

WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss