InterSpeech 2021

Alternate Endings: Improving Prosody for Incremental Neural TTS with Predicted Future Text Input
(Oral presentation)

Brooke Stephenson (GIPSA-lab (UMR 5216), France), Thomas Hueber (GIPSA-lab (UMR 5216), France), Laurent Girin (GIPSA-lab (UMR 5216), France), Laurent Besacier (LIG (UMR 5217), France)
Inferring the prosody of a word in text-to-speech synthesis requires information about its surrounding context. In incremental text-to-speech synthesis, where the synthesizer produces an output before it has access to the complete input, the full context is often unknown which can result in a loss of naturalness. In this paper, we investigate whether the use of predicted future text from a transformer language model can attenuate this loss in a neural TTS system. We compare several test conditions of next future word: (a) unknown (zero-word), (b) language model predicted, (c) randomly predicted and (d) ground-truth. We measure the prosodic features (pitch, energy and duration) and find that predicted text provides significant improvements over a zero-word lookahead, but only slight gains over random-word lookahead. We confirm these results with a perceptive test.