|Wupeng Wang (NUS, Singapore), Chenglin Xu (NUS, Singapore), Meng Ge (NUS, Singapore), Haizhou Li (NUS, Singapore)|
In this paper, we propose a novel time-domain speaker-speech cross-attention network as a variant of SpEx  architecture, that features speaker-speech cross-attention. The speaker-speech cross-attention network consists of speech semantic layers that capture the high-level dependency of audio feature, and cross-attention layers that fuse speaker embedding and speech features to estimate the speaker mask. We implement cross-attention layers with both parallel and sequential concatenation techniques. Experiments show that the proposed models consistently outperform the state-of-the-art time-domain speaker extraction baseline on WSJ0-2mix dataset.