PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS <BR>(3 minutes introduction)

PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS
(3 minutes introduction)

Ye Jia (Google, USA), Heiga Zen (Google, Japan), Jonathan Shen (Google, USA), Yu Zhang (Google, USA), Yonghui Wu (Google, USA)

This paper introduces PnG BERT, a new encoder model for neural TTS. This model is augmented from the original BERT model, by taking both phoneme and grapheme representations of text as input, as well as the word-level alignment between them. It can be pre-trained on a large text corpus in a self-supervised manner, and fine-tuned in a TTS task. Experimental results show that a neural TTS model using a pre-trained PnG BERT as its encoder yields more natural prosody and more accurate pronunciation than a baseline model using only phoneme input with no pre-training. Subjective side-by-side preference evaluations show that raters have no statistically significant preference between the speech synthesized using a PnG BERT and ground truth recordings from professional speakers.

Search in Audio

Related Recordings

Information Sieve: Content Leakage Reduction in End-to-End Prosody Transfer for Expressive Speech Synthesis
(longer introduction)

Xudong Dai , Cheng Gong , Longbiao Wang , Kaili Zhang

Speed up training with variable length inputs by efficient batching strategies
(3 minutes introduction)

Zhenhao Ge , Lakshmish Kaushik , Masanori Omote , Saket Kumar

InterSpeech 2021

PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS (3 minutes introduction)

Search in Audio

Related Recordings

Information Sieve: Content Leakage Reduction in End-to-End Prosody Transfer for Expressive Speech Synthesis (longer introduction)

Speed up training with variable length inputs by efficient batching strategies (3 minutes introduction)

PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS
(3 minutes introduction)

Information Sieve: Content Leakage Reduction in End-to-End Prosody Transfer for Expressive Speech Synthesis
(longer introduction)

Speed up training with variable length inputs by efficient batching strategies
(3 minutes introduction)