Context-Dependent Deep Neural Networks for Large Vocabulary Speech Recognition: From Discovery to Practical Systems
|Frank Seide (Microsoft Research Asia)|
In 2010, it was shown that the combination of hybrid ANN-HMMs with both traditional senone modeling and deep learning is a powerful new acoustic model for ASR. Dubbed the Context-Dependent Deep-Neural-Network HMM, or CD-DNN-HMM, it has so far led to over 40% relative error reduction for speaker-independent recognition on the Switchboard benchmark, compared to the conventional GMM baseline. This is arguably the largest gain obtained through a single technology in ASR. This talk will describe how this discovery has been further developed towards use in practical systems. We will specifically focus on the remarkable benefits from the DNN's ability to learn better feature representations and how they can help in real-life applications; as well as the no less remarkable difficulties arising from the computational cost in training and at runtime, and approaches to address them.
Frank Seide, a native of Hamburg, Germany, is a Senior Researcher/Research Manager at Microsoft Research. His current research focus is on deep neural networks for conversational speech recognition; together with co-author Dong Yu, he was first to show the effectiveness of CD-DNN-HMMs for recognition of conversational speech. Since graduation in 1993, Frank has worked on various speech topics, first at Philips Research in Aachen and Taipei, now at Microsoft Research Asia (Beijing), including spoken-dialogue systems, Mandarin speech recognition, audio search, and speech-to-speech translation.