InterSpeech 2021

Fine-tuning pre-trained voice conversion model for adding new target speakers with limited data
(3 minutes introduction)

Takeshi Koshizuka (Tokyo University of Science, Japan), Hidefumi Ohmura (Tokyo University of Science, Japan), Kouichi Katsurada (Tokyo University of Science, Japan)
Voice conversion (VC) is a technique that converts speaker-dependent non-linguistic information into that of another speaker, while retaining the linguistic information of the input speech. A typical VC system comprises two modules: an encoder module that removes speaker individuality from the input speech and a decoder module that incorporates another speaker’s individuality in synthesized speech. This paper proposes a training method for a vocoder-free any-to-many encoder-decoder VC model with limited data. Various pre-training techniques have been proposed to solve problems training to limited training data; some of these techniques employ the text-to-speech (TTS) task for pre-training. We pre-train the decoder module in the voice conversion task for growing our pre-training technique into continuously adding target speakers to the VC system. The experimental results show that good conversion performance can be achieved by conducting VC-based pre-training. We also confirmed that the rehearsal and pseudo-rehearsal methods can effectively fine-tune the model without degrading the conversion performance of the pre-trained target speakers.