InterSpeech 2021

TacoLPCNet: Fast and Stable TTS by Conditioning LPCNet on Mel Spectrogram Predictions
(3 minutes introduction)

Cheng Gong (Tianjin University, China), Longbiao Wang (Tianjin University, China), Ju Zhang (Huiyan Technology, China), Shaotong Guo (Tianjin University, China), Yuguang Wang (Huiyan Technology, China), Jianwu Dang (Tianjin University, China)
The combination of the recently proposed LPCNet vocoder and a seq-to-seq acoustic model, i.e., Tacotron, has successfully achieved lightweight speech synthesis systems. However, the quality of synthesized speech is often unstable because the precision of the pitch parameters predicted by acoustic models is insufficient, especially for some tonal languages like Chinese and Japanese. In this paper, we propose an end-to-end speech synthesis system, TacoLPCNet, by conditioning LPCNet on Mel spectrogram predictions. First, we extend LPCNet for the Mel spectrogram instead of using explicit pitch information and pitch-related network. Furthermore, we optimize the system by model pruning, multi-frame inference, and increasing frame length, to enable it to meet the conditions required for real-time applications. The objective and subjective evaluation results for various languages show that the proposed system is more stable for tonal languages within the proposed optimization strategies. The experimental results also verify that our model improves synthesis runtime by 3.12 times than that of the baseline on a standard CPU while maintaining naturalness.