Cross-lingual Speaker Adaptation using Domain Adaptation and Speaker Consistency Loss for Text-To-Speech Synthesis
|Detai Xin (University of Tokyo, Japan), Yuki Saito (University of Tokyo, Japan), Shinnosuke Takamichi (University of Tokyo, Japan), Tomoki Koriyama (University of Tokyo, Japan), Hiroshi Saruwatari (University of Tokyo, Japan)|
We present a cross-lingual speaker adaptation method based on domain adaptation and a speaker consistency loss for text-to-speech (TTS) synthesis. Existing monolingual speaker adaptation methods based on direct fine-tuning are not applicable for cross-lingual data. The proposed method first trains a language-independent speaker encoder by speaker verification using domain adaption on multilingual data, including the source and the target languages. Then the proposed method trains a monolingual multi-speaker TTS model on the source language’s data using the speaker embeddings generated by the speaker encoder. To adapt the TTS model of the source language to new speakers the proposed method uses a speaker consistency loss to maximize the cosine similarity between speaker embeddings generated from the natural speech and the same speaker’s synthesized speech. This makes fine-tuning the TTS model of source language on speech data of target language become possible. We conduct experiments on multi-speaker English and Japanese datasets with 207 speakers in total. Results of comprehensive experiments demonstrate that the proposed method can significantly improve speech naturalness compared to the baseline method.