ASRU 2011


Speech Synthesis as A Statistical Machine Learning Problem

Keiichi Tokuda (Nagoya Institute of Technology)

Speech synthesis is often regarded as a messy problem. This talk will discuss how we can formulate the problem of speech synthesis in a statistical machine learning framework. The basic problem of speech synthesis can be stated as follows:

We have a speech database, i.e., a set of speech waveforms and corresponding texts. Given a text to be synthesized, what is the speech waveform corresponding to the text?

The whole text-to-speech generation process can be decomposed into feasible subproblems, which can also be combined as a statistical model for training. One of the subproblems is statistical parametric speech synthesis, which is called "HMM-based speech synthesis" when we use hidden Markov models (HMMs) as statistical models. The talk will also discuss future challenges and the direction in speech synthesis research.

  Outline

0:00:30

Intro

0:03:07

Statistical Formulation of Speech Synthesis

0:07:44

Mathematical Formulation

0:08:13

HMM-based speech synthesis system

0:16:57

Contextual factors

0:20:13

Composition of sentence HMM for given text

0:22:59

Dynamic features

0:29:31

Examples Demonstrating Its Flexibility

0:43:34

Discussion and Conclusion

0:45:21

Summary