Learning speech models from multi-modal data
|Karen Livescu (TTI-Chicago)|
Abstract Speech is usually recorded as an acoustic signal, but it often appears in context with other signals. In addition to the acoustic signal, we may have available a corresponding visual scene, the video of the speaker, physiological signals such as the speaker's movements or neural recordings, or other related signals. It is often possible to learn a better speech model or representation by considering the context provided by these additional signals, or to learn with less training data. Typical approaches to training from multi-modal data are based on the idea that models or representations of each modality should be in some sense predictive of the other modalities. Multi-modal approaches can also take advantage of the fact that the sources of noise or nuisance variables are different in different measurement modalities, so an additional (non-acoustic) modality can help learn a speech representation that suppresses such noise. This talk will survey several lines of work in this area, both older and newer. It will cover some basic techniques from machine learning and statistics, as well as specific models and applications for speech. Bio Karen Livescu is an Associate Professor at TTI-Chicago. She completed her PhD in electrical engineering and computer science at MIT. Her main research interests are in speech and language processing, as well as related problems in machine learning. Some specific interests include multi-view representation learning, visually grounded speech models, acoustic word embeddings, new models for speech recognition and understanding, unsupervised and weakly supervised models for speech and text, and sign language recognition from video. Her professional activities include serving as a program chair of ICLR 2019, ASRU 2015/2017/2019, and Interspeech 2022, and on the editorial boards of IEEE OJ-SP and IEEE TPAMI. She is an ISCA fellow and an IEEE SPS Distinguished Lecturer.