ASRU 2011

Speech Synthesis as A Statistical Machine Learning Problem

Keiichi Tokuda (Nagoya Institute of Technology)

Speech synthesis is often regarded as a messy problem. This talk will discuss how we can formulate the problem of speech synthesis in a statistical machine learning framework. The basic problem of speech synthesis can be stated as follows:

We have a speech database, i.e., a set of speech waveforms and corresponding texts. Given a text to be synthesized, what is the speech waveform corresponding to the text?

The whole text-to-speech generation process can be decomposed into feasible subproblems, which can also be combined as a statistical model for training. One of the subproblems is statistical parametric speech synthesis, which is called "HMM-based speech synthesis" when we use hidden Markov models (HMMs) as statistical models. The talk will also discuss future challenges and the direction in speech synthesis research.