InterSpeech 2021

Neural target speech extraction

Kateřina Žmolíková, Marc Delcroix
Dealing with overlapping speech remains one of the great challenges of speech processing. Target speech extraction consists of directly estimating speech of a desired speaker in a speech mixture, given clues about that speaker, such as a short enrollment utterance or video of the speaker. It is an emergent field of research that has gained increased attention since it provides a practical alternative to blind source separation for processing overlapping speech. Indeed, by focusing on extracting only one speaker, target speech extraction can relax some of the limitations of blind source separation, such as the necessity of knowing the number of speakers, and the speaker permutation ambiguity. In this tutorial, we will present an in-depth review of neural target speech extraction including audio, visual, and multi-channel approaches, covering the basic concepts up to the most recent developments in the field. We will provide a uniformed presentation of the different approaches to emphasize their similarities and differences. We will also discuss extensions to other tasks such as speech recognition or voice activity detection and diarization. Marc Delcroix Marc Delcroix received the M.Eng. degree from the Free University of Brussels, Brussels, Belgium, and the École Centrale Paris, Paris, France, in 2003, and the Ph.D. degree from Hokkaido University, Sapporo, Japan, in 2007. He was a Research Associate with NTT Communication Science Laboratories (CS labs), Kyoto, Japan, from 2007 to 2008 and 2010 to 2012, where he then became a Permanent Research Scientist in 2012. He was a Visiting Lecturer with Waseda University, Tokyo, Japan, from 2015 to 2018. He is currently a Distinguished Researcher with CS labs. His research interests cover various aspects of speech signal processing such as robust speech recognition, speech enhancement, target speech extraction, model adaptation, etc. Together with Kateřina Žmolíková, they pioneered the field of neural network-based target speech extraction and he has been actively pursuing research on that direction, publishing also early works on target speaker-ASR, audio-visual target speech extraction, and presenting a show-and-tell on the topic at ICASSP'19. Dr. Delcroix is a member of the IEEE Signal Processing Society Speech and Language Processing Technical Committee (SLTC). He was one of the organizers of the REVERB Challenge 2014 and the ASRU 2017. He was also a senior affiliate at the Jelinek workshop on speech and language technology (JSALT) in 2015 and 2020. Kateřina Žmolíková Kateřina Žmolíková received the B.Sc. degree in information technology in 2014 and the Ing. degree in mathematical methods in information technology in 2016 from the Faculty of Information Technology, Brno University of Technology (BUT), Czech Republic, where she is currently working towards her Ph.D. degree. Since 2013, she has been part of the Speech@FIT research group at BUT. She took part in an internship in Toshiba Research Laboratory in Cambridge in 2014 and in the Signal Processing Research Group in NTT in Kyoto in 2017. She also took part in the Jelinek workshop on speech and language technology in 2015 and 2020.