InterSpeech 2021

Fine-grained Prosody Modeling in Neural Speech Synthesis using ToBI Representation
(3 minutes introduction)

Yuxiang Zou (ByteDance, China), Shichao Liu (ByteDance, China), Xiang Yin (ByteDance, China), Haopeng Lin (ByteDance, China), Chunfeng Wang (ByteDance, China), Haoyu Zhang (ByteDance, China), Zejun Ma (ByteDance, China)
Benefiting from the great development of deep learning, modern neural text-to-speech (TTS) models can generate speech indistinguishable from natural speech. However, The generated utterances often keep an average prosodic style of the database instead of having rich prosodic variation. For pitch-stressed languages, such as English, accurate intonation and stress are important for conveying semantic information. In this work, we propose a fine-grained prosody modeling method in neural speech synthesis with ToBI (Tones and Break Indices) representation. The proposed system consists of a text frontend for ToBI prediction and a Tacotron-based TTS module for prosody modeling. By introducing the ToBI representation, we can control the system to synthesize speech with accurate intonation and stress at syllable level. Compared with the two baselines (Tacotron and unsupervised method), experiments show that our model can generate more natural speech with more accurate prosody, as well as effectively control the stress, intonation, and pause of the speech.