Cross-lingual Speaker Adaptation using Domain Adaptation and Speaker Consistency Loss for Text-To-Speech Synthesis <BR>(longer introduction)

Cross-lingual Speaker Adaptation using Domain Adaptation and Speaker Consistency Loss for Text-To-Speech Synthesis
(longer introduction)

Detai Xin (University of Tokyo, Japan), Yuki Saito (University of Tokyo, Japan), Shinnosuke Takamichi (University of Tokyo, Japan), Tomoki Koriyama (University of Tokyo, Japan), Hiroshi Saruwatari (University of Tokyo, Japan)

We present a cross-lingual speaker adaptation method based on domain adaptation and a speaker consistency loss for text-to-speech (TTS) synthesis. Existing monolingual speaker adaptation methods based on direct fine-tuning are not applicable for cross-lingual data. The proposed method first trains a language-independent speaker encoder by speaker verification using domain adaption on multilingual data, including the source and the target languages. Then the proposed method trains a monolingual multi-speaker TTS model on the source language’s data using the speaker embeddings generated by the speaker encoder. To adapt the TTS model of the source language to new speakers the proposed method uses a speaker consistency loss to maximize the cosine similarity between speaker embeddings generated from the natural speech and the same speaker’s synthesized speech. This makes fine-tuning the TTS model of source language on speech data of target language become possible. We conduct experiments on multi-speaker English and Japanese datasets with 207 speakers in total. Results of comprehensive experiments demonstrate that the proposed method can significantly improve speech naturalness compared to the baseline method.

Search in Audio

Related Recordings

Cross-lingual Speaker Adaptation using Domain Adaptation and Speaker Consistency Loss for Text-To-Speech Synthesis
(3 minutes introduction)

Detai Xin , Yuki Saito , Shinnosuke Takamichi , Tomoki Koriyama , Hiroshi Saruwatari

Investigating Contributions of Speech and Facial Landmarks for Talking Head Generation
(3 minutes introduction)

Ege Kesim , Engin Erzin

InterSpeech 2021

Cross-lingual Speaker Adaptation using Domain Adaptation and Speaker Consistency Loss for Text-To-Speech Synthesis (longer introduction)

Search in Audio

Related Recordings

Cross-lingual Speaker Adaptation using Domain Adaptation and Speaker Consistency Loss for Text-To-Speech Synthesis (3 minutes introduction)

Investigating Contributions of Speech and Facial Landmarks for Talking Head Generation (3 minutes introduction)

Cross-lingual Speaker Adaptation using Domain Adaptation and Speaker Consistency Loss for Text-To-Speech Synthesis
(longer introduction)

Cross-lingual Speaker Adaptation using Domain Adaptation and Speaker Consistency Loss for Text-To-Speech Synthesis
(3 minutes introduction)

Investigating Contributions of Speech and Facial Landmarks for Talking Head Generation
(3 minutes introduction)