Synthesis of expressive speaking styles with limited training data in a multi-speaker, prosody-controllable sequence-to-sequence architecture <BR>(3 minutes introduction)

Synthesis of expressive speaking styles with limited training data in a multi-speaker, prosody-controllable sequence-to-sequence architecture
(3 minutes introduction)

Slava Shechtman (IBM, Israel), Raul Fernandez (IBM, USA), Alexander Sorin (IBM, Israel), David Haws (IBM, USA)

Although Sequence-to-Sequence (S2S) architectures have become state-of-the-art in speech synthesis, the best models benefit from access to moderate-to-large amounts of training data, posing a resource bottleneck when we are interested in generating speech in a variety of expressive styles. In this work we explore a S2S architecture variant that is capable of generating a variety of stylistic expressive variations observed in a limited amount of training data, and of transplanting that style to a neutral target speaker for whom no labeled expressive resources exist. The architecture is furthermore controllable, allowing the user to select an operating point that conveys a desired level of expressiveness. We evaluate this proposal against a classically supervised baseline via perceptual listening tests, and demonstrate that i) it is able to outperform the baseline in terms of its generalizability to neutral speakers, ii) it is strongly preferred in terms of its ability to convey expressiveness, and iii) it provides a reasonable trade-off between expressiveness and naturalness, allowing the user to tune it to the particular demands of a given application.

InterSpeech 2021

Synthesis of expressive speaking styles with limited training data in a multi-speaker, prosody-controllable sequence-to-sequence architecture
(3 minutes introduction)

Search in Audio

Related Recordings

Towards Multi-Scale Style Control for Expressive Speech Synthesis
(3 minutes introduction)

Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis
(3 minutes introduction)

InterSpeech 2021

Synthesis of expressive speaking styles with limited training data in a multi-speaker, prosody-controllable sequence-to-sequence architecture (3 minutes introduction)

Search in Audio

Related Recordings

Towards Multi-Scale Style Control for Expressive Speech Synthesis (3 minutes introduction)

Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis (3 minutes introduction)

Synthesis of expressive speaking styles with limited training data in a multi-speaker, prosody-controllable sequence-to-sequence architecture
(3 minutes introduction)

Towards Multi-Scale Style Control for Expressive Speech Synthesis
(3 minutes introduction)

Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis
(3 minutes introduction)