|Rémi Rigal (Orange Labs, France), Jacques Chodorowski (Orange Labs, France), Benoît Zerr (Lab-STICC (UMR 6285), France)|
We present a deep neural network that relies on facial motion and time-domain audio for isolating speech signals from a mixture of speeches and background noises. Recent studies in deep learning-based audio-visual speech separation and speech enhancement have proven that leveraging visual information in addition to audio can yield substantial improvement to the prediction quality and robustness. We propose to use facial motion, inferred from optical flow techniques, as a visual feature input for our model. Combined with state-of-the-art audio-only speech separation approaches, we demonstrate that facial motion significantly improves the speech quality as well as the versatility of the model. Our proposed method offers a signal-to-distortion improvement of up to 4.2 dB on two-speaker mixtures when compared to other audio-visual approaches.