Deep audio-visual speech separation based on facial motion <BR>(3 minutes introduction)

Deep audio-visual speech separation based on facial motion
(3 minutes introduction)

Rémi Rigal (Orange Labs, France), Jacques Chodorowski (Orange Labs, France), Benoît Zerr (Lab-STICC (UMR 6285), France)

We present a deep neural network that relies on facial motion and time-domain audio for isolating speech signals from a mixture of speeches and background noises. Recent studies in deep learning-based audio-visual speech separation and speech enhancement have proven that leveraging visual information in addition to audio can yield substantial improvement to the prediction quality and robustness. We propose to use facial motion, inferred from optical flow techniques, as a visual feature input for our model. Combined with state-of-the-art audio-only speech separation approaches, we demonstrate that facial motion significantly improves the speech quality as well as the versatility of the model. Our proposed method offers a signal-to-distortion improvement of up to 4.2 dB on two-speaker mixtures when compared to other audio-visual approaches.

InterSpeech 2021

Deep audio-visual speech separation based on facial motion
(3 minutes introduction)

Search in Audio

Related Recordings

IMPROVED SPEECH SEPARATION WITH TIME-AND-FREQUENCY CROSS-DOMAIN FEATURE SELECTION
(3 minutes introduction)

Neural Speaker Extraction with Speaker-Speech Cross-Attention Network
(3 minutes introduction)

InterSpeech 2021

Deep audio-visual speech separation based on facial motion (3 minutes introduction)

Search in Audio

Related Recordings

IMPROVED SPEECH SEPARATION WITH TIME-AND-FREQUENCY CROSS-DOMAIN FEATURE SELECTION (3 minutes introduction)

Neural Speaker Extraction with Speaker-Speech Cross-Attention Network (3 minutes introduction)

Deep audio-visual speech separation based on facial motion
(3 minutes introduction)

IMPROVED SPEECH SEPARATION WITH TIME-AND-FREQUENCY CROSS-DOMAIN FEATURE SELECTION
(3 minutes introduction)

Neural Speaker Extraction with Speaker-Speech Cross-Attention Network
(3 minutes introduction)