InterSpeech 2021

End-to-End Speech Separation Using Orthogonal Representation in Complex and Real Time-Frequency Domain
(3 minutes introduction)

Kai Wang (Xinjiang University, China), Hao Huang (Xinjiang University, China), Ying Hu (Xinjiang University, China), Zhihua Huang (Xinjiang University, China), Sheng Li (NICT, Japan)
Traditional single channel speech separation in the time-frequency (T-F) domain often faces the problem of phase reconstruction. Due to the fact that the real-valued network is not suitable for dealing with complex-valued representation, the performance of the T-F domain speech separation method is often constrained from reaching the state-of-the-art. In this paper, we propose improved speech separation methods in both complex and real T-F domain using orthogonal representation. For the complex-valued case, we combine the deep complex network (DCN) and Conv-TasNet to design an end-to-end complex-valued model. Specifically, we incorporate short-time Fourier transform (STFT) and learnable complex layers to build a hybrid encoder-decoder structure, and use a DCN based separator. Then we present the importance of weights orthogonality in the T-F domain transformation and propose a multi-segment orthogonality (MSO) architecture for further improvements. For the real-valued case, we performed separation in real T-F domain by introducing the short-time DCT (STDCT) with orthogonal representation as well. Experimental results show that the proposed complex model outperforms the baseline Conv-TasNet with a comparable parameter size by 1.8 dB, and the STDCT-based real-valued T-F model by 1.2 dB, showing the advantages of speech separation in the T-F domain.