TacoLPCNet: Fast and Stable TTS by Conditioning LPCNet on Mel Spectrogram Predictions <BR>(3 minutes introduction)

TacoLPCNet: Fast and Stable TTS by Conditioning LPCNet on Mel Spectrogram Predictions
(3 minutes introduction)

Cheng Gong (Tianjin University, China), Longbiao Wang (Tianjin University, China), Ju Zhang (Huiyan Technology, China), Shaotong Guo (Tianjin University, China), Yuguang Wang (Huiyan Technology, China), Jianwu Dang (Tianjin University, China)

The combination of the recently proposed LPCNet vocoder and a seq-to-seq acoustic model, i.e., Tacotron, has successfully achieved lightweight speech synthesis systems. However, the quality of synthesized speech is often unstable because the precision of the pitch parameters predicted by acoustic models is insufficient, especially for some tonal languages like Chinese and Japanese. In this paper, we propose an end-to-end speech synthesis system, TacoLPCNet, by conditioning LPCNet on Mel spectrogram predictions. First, we extend LPCNet for the Mel spectrogram instead of using explicit pitch information and pitch-related network. Furthermore, we optimize the system by model pruning, multi-frame inference, and increasing frame length, to enable it to meet the conditions required for real-time applications. The objective and subjective evaluation results for various languages show that the proposed system is more stable for tonal languages within the proposed optimization strategies. The experimental results also verify that our model improves synthesis runtime by 3.12 times than that of the baseline on a standard CPU while maintaining naturalness.

Search in Audio

Related Recordings

Phonetic and Prosodic Information Estimation Using Neural Machine Translation for Genuine Japanese End-to-End Text-to-Speech
(3 minutes introduction)

Naoto Kakegawa , Sunao Hara , Masanobu Abe , Yusuke Ijima

Phonetic and Prosodic Information Estimation Using Neural Machine Translation for Genuine Japanese End-to-End Text-to-Speech
(longer introduction)

Naoto Kakegawa , Sunao Hara , Masanobu Abe , Yusuke Ijima

InterSpeech 2021

TacoLPCNet: Fast and Stable TTS by Conditioning LPCNet on Mel Spectrogram Predictions (3 minutes introduction)

Search in Audio

Related Recordings

Phonetic and Prosodic Information Estimation Using Neural Machine Translation for Genuine Japanese End-to-End Text-to-Speech (3 minutes introduction)

Phonetic and Prosodic Information Estimation Using Neural Machine Translation for Genuine Japanese End-to-End Text-to-Speech (longer introduction)

TacoLPCNet: Fast and Stable TTS by Conditioning LPCNet on Mel Spectrogram Predictions
(3 minutes introduction)

Phonetic and Prosodic Information Estimation Using Neural Machine Translation for Genuine Japanese End-to-End Text-to-Speech
(3 minutes introduction)

Phonetic and Prosodic Information Estimation Using Neural Machine Translation for Genuine Japanese End-to-End Text-to-Speech
(longer introduction)