Fine-grained Prosody Modeling in Neural Speech Synthesis using ToBI Representation <BR>(3 minutes introduction)

Fine-grained Prosody Modeling in Neural Speech Synthesis using ToBI Representation
(3 minutes introduction)

Yuxiang Zou (ByteDance, China), Shichao Liu (ByteDance, China), Xiang Yin (ByteDance, China), Haopeng Lin (ByteDance, China), Chunfeng Wang (ByteDance, China), Haoyu Zhang (ByteDance, China), Zejun Ma (ByteDance, China)

Benefiting from the great development of deep learning, modern neural text-to-speech (TTS) models can generate speech indistinguishable from natural speech. However, The generated utterances often keep an average prosodic style of the database instead of having rich prosodic variation. For pitch-stressed languages, such as English, accurate intonation and stress are important for conveying semantic information. In this work, we propose a fine-grained prosody modeling method in neural speech synthesis with ToBI (Tones and Break Indices) representation. The proposed system consists of a text frontend for ToBI prediction and a Tacotron-based TTS module for prosody modeling. By introducing the ToBI representation, we can control the system to synthesize speech with accurate intonation and stress at syllable level. Compared with the two baselines (Tacotron and unsupervised method), experiments show that our model can generate more natural speech with more accurate prosody, as well as effectively control the stress, intonation, and pause of the speech.

Search in Audio

Related Recordings

Phoneme Duration Modeling Using Speech Rhythm-Based Speaker Embeddings for Multi-Speaker Speech Synthesis
(3 minutes introduction)

Kenichi Fujita , Atsushi Ando , Yusuke Ijima

Intra-Sentential Speaking Rate Control in Neural Text-To-Speech for Automatic Dubbing
(3 minutes introduction)

Mayank Sharma , Yogesh Virkar , Marcello Federico , Roberto Barra-Chicote , Robert Enyedi

InterSpeech 2021

Fine-grained Prosody Modeling in Neural Speech Synthesis using ToBI Representation (3 minutes introduction)

Search in Audio

Related Recordings

Phoneme Duration Modeling Using Speech Rhythm-Based Speaker Embeddings for Multi-Speaker Speech Synthesis (3 minutes introduction)

Intra-Sentential Speaking Rate Control in Neural Text-To-Speech for Automatic Dubbing (3 minutes introduction)

Fine-grained Prosody Modeling in Neural Speech Synthesis using ToBI Representation
(3 minutes introduction)

Phoneme Duration Modeling Using Speech Rhythm-Based Speaker Embeddings for Multi-Speaker Speech Synthesis
(3 minutes introduction)

Intra-Sentential Speaking Rate Control in Neural Text-To-Speech for Automatic Dubbing
(3 minutes introduction)