ASRU 2013

Building Speech Recognition Systems with Low Resources

Tanja Schultz (KIT)

The performance of speech and language processing technologies has improved dramatically over the past decade, with an increasing number of systems being deployed in a large variety of applications, such as spoken dialog systems, speech translation and information retrieval systems. Most efforts to date were focused on a very small number of languages spoken by a large number of speakers in countries of great economic potential, and a population with immediate information technology needs. However, speech technology has a lot to contribute to those languages that do not fall into this category. Firstly, languages with a small number of speakers and few linguistic resources may suddenly become of interest for humanitarian, economical or regional conflict reasons. Secondly, a large number of languages are in danger of becoming extinct, and ongoing projects for preserving them could benefit from speech and language technologies.

With close to 7000 languages in the world and the need to support multiple input and output languages, the most important challenge today is to port speech processing systems to new languages rapidly and at reasonable costs. Major bottlenecks are the sparseness of speech and text data, the lack of language conventions, and the gap between technology and language expertise. Large-scale data resources are currently available for a minority of languages and the costs for data collections are prohibitive to all but the most widely spoken and economically viable languages. Thus data sparseness is one the most pressing challenges and is further exacerbate by the fact that today’s speech technologies heavily rely on statistically based modeling schemes, such as Hidden Markov Models and N-gram language models. While statistical modeling algorithms are mostly language independent and proved to work well for a variety of languages, the building of well performing speech recognition systems is far from being language independent. The building process must take into account language peculiarities, such as sound systems, phonotactics, word segmentation, and morphology. The lack of language conventions concerns a surprisingly large number of languages or dialects. The lack of a standardized writing system for example concerns the majority of languages and hinders web harvesting of large text corpora as well as the construction of vocabularies and dictionaries. Last but not least, despite the well-defined process of system building it is not only cost- and time consuming to handle language-specific peculiarities but it also requires substantial language expertise. Unfortunately, it is often difficult to find system developers who simultaneously have the required technical background and significant insight into the language in question. Consequently, one of the central issues in developing speech processing systems in many languages is the challenge of bridging the gap between language and technology expertise.

In my talk I will present ongoing work at the Cognitive Systems Lab on rapidly building speech recognition systems for yet unsupported languages if few speech data and few or no transcripts, text or linguistic resources are available. This includes the sharing of data and models across languages, as well as the rapid adaptation of language independent models to yet unsupported languages. Techniques and tools will be described which lower the overall costs for system development by automating the system building process, leveraging off crowd sourcing, and reducing the data needs without suffering significant performance losses. Finally, I will present the web-based Rapid Language Adaptation Toolkit (RLAT), an online service ( that enables native language experts to build speech recognition components without requiring detailed technology expertise. RLAT enables the user to collect speech data or leverage off multilingual seed models to initialize and train acoustic models, to harvest large amounts of text data in order to create language models, and to automatically derive vocabularies and generate pronunciation models. The resulting components can be evaluated in an end-to-end system allowing for iterative improvements. By archiving the data gathered on-the-fly from many cooperative users, we hope to significantly increase the repository of languages and linguistic resources, and make the data and components available at large to the community. By keeping the users in the developmental loop, RLAT can learn from the users’ expertise to constantly adapt and improve. This will hopefully revolutionize the system development process for yet under-supported languages.


Tanja Schultz received her Ph.D. and Masters in Computer Science from University of Karlsruhe, Germany in 2000 and 1995 respectively and passed the German state examination for teachers of Mathematics, Sports, and Educational Science from Heidelberg University, in 1990. She joined Carnegie Mellon University in 2000 and became a Research Professor at the Language Technologies Institute. Since 2007 she is a Full Professor at the Department of Informatics of the Karlsruhe Institute of Technology (KIT) in Germany. She is the director of the Cognitive Systems Lab, where her research activities focus on human-machine interfaces with a particular area of expertise in rapid adaptation of speech processing systems to new domains and languages. She co-edited a book on this subject and received several awards for this work. In 2001 she received the FZI price for an outstanding Ph.D. thesis. In 2002 she was awarded the Allen Newell Medal for Research Excellence from Carnegie Mellon for her contribution to Speech Translation and the ISCA best paper award for her publication on language independent acoustic modeling. In 2005 she received the Carnegie Mellon Language Technologies Institute Junior Faculty Chair. Her recent research work on silent speech interfaces based on myoelectric signals received best demo and paper prices and was awarded with the Alcatel-Lucent Research Award for Technical Communication in 2012. Tanja Schultz is the author of more than 250 articles published in books, journals, and proceedings. She is a member of the Society of Computer Science (GI) for more than 20 years, of the IEEE Computer Society, and the International Speech Communication Association ISCA, where she serves her second term as an elected ISCA board member.