InterSpeech 2021

Multimodal systems

Direct multimodal few-shot learning of speech and images
(3 minutes introduction)

Leanne Nortje (Stellenbosch University, South Africa), Herman Kamper (Stellenbosch University, South Africa)

Talk, Don't Write: A Study of Direct Speech-Based Image Retrieval
(3 minutes introduction)

Ramon Sanabria (University of Edinburgh, UK), Austin Waters (Google, USA), Jason Baldridge (Google, USA)

Cross-Modal Knowledge Distillation Method for Automatic Cued Speech Recognition
(3 minutes introduction)

Jianrong Wang (Tianjin University, China), Ziyue Tang (Tianjin University, China), Xuewei Li (Tianjin University, China), Mei Yu (Tianjin University, China), Qiang Fang (CASS, China), Li Liu (CUHK, China)

Cross-Modal Knowledge Distillation Method for Automatic Cued Speech Recognition
(longer introduction)

Jianrong Wang (Tianjin University, China), Ziyue Tang (Tianjin University, China), Xuewei Li (Tianjin University, China), Mei Yu (Tianjin University, China), Qiang Fang (CASS, China), Li Liu (CUHK, China)

Attention-Based Keyword Localisation in Speech using Visual Grounding
(3 minutes introduction)

Kayode Olaleye (Stellenbosch University, South Africa), Herman Kamper (Stellenbosch University, South Africa)

Automatic Lip-reading with Hierarchical Pyramidal Convolution and Self-Attention for Image Sequences with No Word Boundaries
(3 minutes introduction)

Hang Chen (USTC, China), Jun Du (USTC, China), Yu Hu (USTC, China), Li-Rong Dai (USTC, China), Bao-Cai Yin (iFLYTEK, China), Chin-Hui Lee (Georgia Tech, USA)

Automatic Lip-reading with Hierarchical Pyramidal Convolution and Self-Attention for Image Sequences with No Word Boundaries
(longer introduction)

Hang Chen (USTC, China), Jun Du (USTC, China), Yu Hu (USTC, China), Li-Rong Dai (USTC, China), Bao-Cai Yin (iFLYTEK, China), Chin-Hui Lee (Georgia Tech, USA)

LiRA: Learning Visual Speech Representations from Audio through Self-supervision
(3 minutes introduction)

Pingchuan Ma (Imperial College London, UK), Rodrigo Mira (Imperial College London, UK), Stavros Petridis (Facebook, UK), Björn W. Schuller (Imperial College London, UK), Maja Pantic (Imperial College London, UK)

End-to-end audio-visual speech recognition for overlapping speech}
(3 minutes introduction)

Richard Rose (Google, USA), Olivier Siohan (Google, USA), Anshuman Tripathi (Google, USA), Otavio Braga (Google, USA)

Audio-Visual Multi-Talker Speech Recognition in A Cocktail Party
(3 minutes introduction)

Yifei Wu (SJTU, China), Chenda Li (SJTU, China), Song Yang (TAL, China), Zhongqin Wu (TAL, China), Yanmin Qian (SJTU, China)