InterSpeech 2020

Doing Something we Never could with Spoken Language Technologies - from early days to the era of deep learning

Lin-shan Lee
National Taiwan University
Abstract Some research effort tries to do something better, while some tries to do something we never could. Good examples for the former include having aircrafts fly faster, and having images look more beautiful ; while good examples for the latter include developing the Internet to connect everyone over the world, and selecting information out of everything over the Internet with Google ; to name a few. The former is always very good, while the latter is usually challenging. This talk is about the latter. A major problem for the latter is those we could never do before was very often very far from realization. This is actually normal for most research work, which could be enjoyed by users only after being realized by industry when the correct time arrived. The only difference is here we may need to wait for longer until the right time comes and the right industry appears. Also, the right industry eventually appeared at the right time may use new generations of technologies very different from the earlier solutions found in research. In this talk I'll present my personal experiences of doing something we never could with spoken language technologies, from early days to the era of deep learning, including how I considered, what I did and found, and what lessons we can learn today, ranging over various areas of spoken language technologies. Lin-shan Lee has been teaching in Electrical Engineering and Computer Science at National Taiwan University since 1979. He invented, published and demonstrated the earliest but very complete set of fundamental technologies and systems for Chinese spoken language technologies including TTS (1984-89), natural language grammar and parser (1986-91) and LVCSR (1987-97), considering the structural features of Chinese language (monosyllable per character, limited number of distinct monosyllables, tones, etc.) and the extremely limited resources. He then focused his work on speech information retrieval, proposing a whole set of approaches making retrieval performance less dependent on ASR accuracy, and improving retrieval efficiency by better user-content interaction. This part of work applies equally to all different languages, and was described as the stepping stones towards "a spoken version of Google" when Nature selected him in 2018 as one of the 10 "Science Stars of East Asia" in a special issue on scientific research in East Asia.