Cross-lingual Low Resource Speaker Adaptation Using Phonological Features <BR>(3 minutes introduction)

Cross-lingual Low Resource Speaker Adaptation Using Phonological Features
(3 minutes introduction)

Georgia Maniati (Samsung, Greece), Nikolaos Ellinas (Samsung, Greece), Konstantinos Markopoulos (Samsung, Greece), Georgios Vamvoukakis (Samsung, Greece), June Sig Sung (Samsung, Korea), Hyoungmin Park (Samsung, Korea), Aimilios Chalamandaris (Samsung, Greece), Pirros Tsiakoulis (Samsung, Greece)

The idea of using phonological features instead of phonemes as input to sequence-to-sequence TTS has been recently proposed for zero-shot multilingual speech synthesis. This approach is useful for code-switching, as it facilitates the seamless uttering of foreign text embedded in a stream of native text. In our work, we train a language-agnostic multispeaker model conditioned on a set of phonologically derived features common across different languages, with the goal of achieving cross-lingual speaker adaptation. We first experiment with the effect of language phonological similarity on cross-lingual TTS of several source-target language combinations. Subsequently, we fine-tune the model with very limited data of a new speaker’s voice in either a seen or an unseen language, and achieve synthetic speech of equal quality, while preserving the target speaker’s identity. With as few as 32 and 8 utterances of target speaker data, we obtain high speaker similarity scores and naturalness comparable to the corresponding literature. In the extreme case of only 2 available adaptation utterances, we find that our model behaves as a few-shot learner, as the performance is similar in both the seen and unseen adaptation language scenarios.

Search in Audio

Related Recordings

Improve Cross-Lingual Text-To-Speech Synthesis on Monolingual Corpora with Pitch Contour Information
(3 minutes introduction)

Haoyue Zhan , Haitong Zhang , Wenjie Ou , Yue Lin

EfficientSing: A Chinese Singing Voice Synthesis System Using Duration-Free Acoustic Model and HiFi-GAN Vocoder
(3 minutes introduction)

Zhengchen Liu , Chenfeng Miao , Qingying Zhu , Minchuan Chen , Jun Ma , Shaojun Wang , Jing Xiao

InterSpeech 2021

Cross-lingual Low Resource Speaker Adaptation Using Phonological Features (3 minutes introduction)

Search in Audio

Related Recordings

Improve Cross-Lingual Text-To-Speech Synthesis on Monolingual Corpora with Pitch Contour Information (3 minutes introduction)

EfficientSing: A Chinese Singing Voice Synthesis System Using Duration-Free Acoustic Model and HiFi-GAN Vocoder (3 minutes introduction)

Cross-lingual Low Resource Speaker Adaptation Using Phonological Features
(3 minutes introduction)

Improve Cross-Lingual Text-To-Speech Synthesis on Monolingual Corpora with Pitch Contour Information
(3 minutes introduction)

EfficientSing: A Chinese Singing Voice Synthesis System Using Duration-Free Acoustic Model and HiFi-GAN Vocoder
(3 minutes introduction)