Normalization Driven Zero-shot Multi-Speaker Speech Synthesis <BR>(3 minutes introduction)

Normalization Driven Zero-shot Multi-Speaker Speech Synthesis
(3 minutes introduction)

Neeraj Kumar (Hike, India), Srishti Goel (Hike, India), Ankur Narang (Hike, India), Brejesh Lall (IIT Delhi, India)

In this paper, we present a novel zero-shot multi-speaker speech synthesis approach (ZSM-SS) that leverages the normalization architecture and speaker encoder with non-autoregressive multi-head attention driven encoder-decoder architecture. Given an input text and a reference speech sample of an unseen person, ZSM-SS can generate speech in that person’s style in a zero-shot manner. Additionally, we demonstrate how the affine parameters of normalization help in capturing the prosodic features such as energy and fundamental frequency in a disentangled fashion and can be used to generate morphed speech output. We demonstrate the efficacy of our proposed architecture on multi-speaker VCTK[1] and LibriTTS [2] datasets, using multiple quantitative metrics that measure generated speech distortion and MOS, along with speaker embedding analysis of the proposed speaker encoder model.

StarGAN-VC+ASR: StarGAN-based Non-Parallel Voice Conversion Regularized by Automatic Speech Recognition
(3 minutes introduction)

Shoki Sakamoto , Akira Taniguchi , Tadahiro Taniguchi , Hirokazu Kameoka

InterSpeech 2021

Normalization Driven Zero-shot Multi-Speaker Speech Synthesis
(3 minutes introduction)

Search in Audio

Related Recordings

Fine-tuning pre-trained voice conversion model for adding new target speakers with limited data
(3 minutes introduction)

StarGAN-VC+ASR: StarGAN-based Non-Parallel Voice Conversion Regularized by Automatic Speech Recognition
(3 minutes introduction)

InterSpeech 2021

Normalization Driven Zero-shot Multi-Speaker Speech Synthesis (3 minutes introduction)

Search in Audio

Related Recordings

Fine-tuning pre-trained voice conversion model for adding new target speakers with limited data (3 minutes introduction)

StarGAN-VC+ASR: StarGAN-based Non-Parallel Voice Conversion Regularized by Automatic Speech Recognition (3 minutes introduction)

Normalization Driven Zero-shot Multi-Speaker Speech Synthesis
(3 minutes introduction)

Fine-tuning pre-trained voice conversion model for adding new target speakers with limited data
(3 minutes introduction)

StarGAN-VC+ASR: StarGAN-based Non-Parallel Voice Conversion Regularized by Automatic Speech Recognition
(3 minutes introduction)