Odyssey 2020

The Speaker and Language Recognition Workshop

Towards Unsupervised Learning of Speech Representations

Dr. Mirco Ravanelli
Université de Montréal, Canada
The success of deep learning techniques strongly depends on the quality of the representations that are automatically discovered from data. These representations should capture intermediate concepts, features, or latent variables, and are commonly learned in a supervised way using large annotated corpora. Even though this is still the dominant paradigm, some crucial limitations arise. Collecting large amounts of annotated examples, for instance, is very costly and time-consuming. Moreover, supervised representations are likely to be biased toward the considered problem, possibly limiting their exportability to other problems and applications. A natural way to mitigate these issues is unsupervised learning. Unsupervised learning attempts to extract knowledge from unlabeled data, and can potentially discover representations that capture the underlying structure of such data. This modality, sometimes referred to as self-supervised learning, is gaining popularity within the computer vision community, while its application on high-dimensional and long temporal sequences like speech still remains challenging. In this keynote, I will summarize some recent efforts to learn general, robust, and transferrable speech representations using unsupervised/self-supervised approaches. In particular, I will focus on a novel technique called Local Info Max (LIM),that learns speech representations using a maximum mutual information approach. I will then introduce the recently-proposed problem-agnostic speech encoder (PASE) that is derived by jointly solving multiple self-supervised tasks. PASE is a first step towards a universal neural speech encoder and turned out to be useful for a large variety of applications such as speech recognition, speaker identification, and emotion recognition. Mirco Ravanelli is currently a post-doc researcher at Mila (Université de Montréal) working under the supervision of Prof. Yoshua Bengio. His main research interests are deep learning, speech recognition, far-field speech recognition, robust acoustic scene analysis, cooperative learning, and unsupervised learning. He is the author or co-author of more than 40 papers on these research topics. He received his Ph.D. (with cum laude distinction) from the University of Trento in December 2017. During his Ph.D., he focused on deep learning for distant speech recognition, with a particular emphasis on noise-robust deep neural architectures.