Neural Speaker Extraction with Speaker-Speech Cross-Attention Network <BR>(3 minutes introduction)

Neural Speaker Extraction with Speaker-Speech Cross-Attention Network
(3 minutes introduction)

Wupeng Wang (NUS, Singapore), Chenglin Xu (NUS, Singapore), Meng Ge (NUS, Singapore), Haizhou Li (NUS, Singapore)

In this paper, we propose a novel time-domain speaker-speech cross-attention network as a variant of SpEx [1] architecture, that features speaker-speech cross-attention. The speaker-speech cross-attention network consists of speech semantic layers that capture the high-level dependency of audio feature, and cross-attention layers that fuse speaker embedding and speech features to estimate the speaker mask. We implement cross-attention layers with both parallel and sequential concatenation techniques. Experiments show that the proposed models consistently outperform the state-of-the-art time-domain speaker extraction baseline on WSJ0-2mix dataset.

InterSpeech 2021

Neural Speaker Extraction with Speaker-Speech Cross-Attention Network
(3 minutes introduction)

Search in Audio

Related Recordings

IMPROVED SPEECH SEPARATION WITH TIME-AND-FREQUENCY CROSS-DOMAIN FEATURE SELECTION
(3 minutes introduction)

Deep audio-visual speech separation based on facial motion
(3 minutes introduction)

InterSpeech 2021

Neural Speaker Extraction with Speaker-Speech Cross-Attention Network (3 minutes introduction)

Search in Audio

Related Recordings

IMPROVED SPEECH SEPARATION WITH TIME-AND-FREQUENCY CROSS-DOMAIN FEATURE SELECTION (3 minutes introduction)

Deep audio-visual speech separation based on facial motion (3 minutes introduction)

Neural Speaker Extraction with Speaker-Speech Cross-Attention Network
(3 minutes introduction)

IMPROVED SPEECH SEPARATION WITH TIME-AND-FREQUENCY CROSS-DOMAIN FEATURE SELECTION
(3 minutes introduction)

Deep audio-visual speech separation based on facial motion
(3 minutes introduction)