InterSpeech 2021

Automatic Lip-reading with Hierarchical Pyramidal Convolution and Self-Attention for Image Sequences with No Word Boundaries
(longer introduction)

Hang Chen (USTC, China), Jun Du (USTC, China), Yu Hu (USTC, China), Li-Rong Dai (USTC, China), Bao-Cai Yin (iFLYTEK, China), Chin-Hui Lee (Georgia Tech, USA)
In this paper, we propose a novel deep learning architecture for improving word-level lip-reading. We first incorporate multi-scale processing into spatial feature extraction for lip-reading using hierarchical pyramidal convolution (HPConv) and self-attention. Specifically, HPConv is proposed to replace the conventional convolution features, leading to an improvement over the model’s ability to discover fine-grained lip movements. Next to deal with fixed-length image sequences representing words in a given database, a self-attention mechanism is proposed to integrate local information in all lip frames without assuming known word boundaries, so that our deep models automatically utilize key feature in relevant frames of a given word. Experiments on the Lip Reading in the Wild corpus show that our proposed architecture achieves an accuracy of 86.83%, yielding a relative error rate reduction of about 10% from that obtained with a state-of-the-art scheme of averaging frame scores for information fusion. A detailed analysis of the experimental results also confirms that weights learned from self-attention tend to be zero at both sides of an image sequence and focus non-zero weights in the middle part of a given word.