Generative Adversarial Networks for Singing Voice Conversion with and without Parallel Data

Berrak Sisman, Haizhou Li

Singing voice conversion (SVC) is a task to convert one singer's voice to sound like that of another, without changing the lyrical content. Singing conveys both lexical and emotional information through words and tones, that needs to be transferred from the source to target. In this paper, we propose novel solutions to SVC based on Generative Adversarial Networks (GANs) with and without parallel training data. With parallel data, we employ GANs to minimize the differences of the distributions between the original target parameters and the generated singing parameters. With non-parallel training data, we employ CycleGANs to estimate an optimal pseudo pair between source and target singers. Moreover, the proposed solutions perform well with limited amount of training data. The experiments show that (1) GANs outperform other state-of-the-art voice conversion when parallel training data are available, (2) CycleGANs achieve competitive voice conversion quality without the need of parallel training data.　

Odyssey 2020

The Speaker and Language Recognition Workshop

Generative Adversarial Networks for Singing Voice Conversion with and without Parallel Data

Search in Audio

Speech Transcript

Related Recordings

Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data

WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss