The 9th International Symposium on Chinese Spoken Language Processing

Tutorial 4: Deep Learning for Speech Generation and Synthesis

Yao Qian and Frank K. Soong

Deep learning, which can represent high-level abstractions in data with an architecture of multiple non-linear transformation, has made a huge impact on automatic speech recognition (ASR) research, products and services. However, deep learning for speech generation and synthesis (i.e., text-to-speech), which is an inverse process of speech recognition (i.e., speech-to-text), has not generated the similar momentum as it is for ASR yet. Recently, motivated by the success of Deep Neural Networks in speech recognition, some neural network based research attempts have been tried successfully on improving the performance of statistical parametric based speech generation/synthesis. In this tutorial, we focus on deep learning approaches to the problems in speech generation and synthesis, especially on Text-to-Speech (TTS) synthesis and voice conversion.

First, we give a review for the current main stream of statistical parametric based speech generation and synthesis, or the GMM-HMM based speech synthesis and GMM-based voice conversion with emphasis on analyzing the major factors responsible for the quality problems in the GMM-based voice synthesis/conversion and the intrinsic limitations of a decision-tree based, contextual state clustering and state-based statistical distribution modeling. We then present the latest deep learning algorithms for feature parameter trajectory generation, in contrast to deep learning for recognition or classification. We cover common technologies in Deep Neural Network (DNN) and improved DNN: Mixture Density Networks (MDN), Recurrent Neural Networks (RNN) with Bidirectional Long Short Term Memory (BLSTM) and Conditional RBM (CRBM). Finally, we share our research insights and hand-on experience on building speech generation and synthesis systems based upon deep learning algorithms.