|Xiaoxue Gao, Xiaohai Tian, Yi Zhou, Rohan Kumar Das, Haizhou Li
In this paper, we formulate a personalized singing voice generation (SVG) framework using WaveRNN with non-parallel training data. We develop an average singing voice generation model using WaveRNN from multi-singer's vocals. To map singing Phonetic PosteriorGrams and prosody features from singing template to time-domain singing samples, a speaker i-vector extracted from target speech is used to control the speaker identity of the generated singing. At run-time, a singing template and target speech samples are used for target singing vocal generation. Specifically, the content and the speaker identity of the target speech is not necessarily the same as that of the singing template. Experimental results on the NUS-48E and NUS-HLT-SLS corpora suggest that the personalized SVG framework outperforms the traditional conversion-vocoder pipeline in the subjective and objective evaluations.