InterSpeech 2021

Investigation of Spatial-Acoustic Features for Overlapping Speech Detection in Multiparty Meetings
(3 minutes introduction)

Shiliang Zhang (Alibaba, China), Siqi Zheng (Alibaba, China), Weilong Huang (Alibaba, China), Ming Lei (Alibaba, China), Hongbin Suo (Alibaba, China), Jinwei Feng (Alibaba, USA), Zhijie Yan (Alibaba, China)
In this paper, we propose an overlapping speech detection (OSD) system for real multiparty meetings. Different from previous works on single-channel recordings or simulated data, we conduct research on real multi-channel data recorded by an 8-microphone array. We investigate how spatial information provided by multi-channel beamforming can benefit OSD. Specifically, we propose a two-stream DFSMN to jointly model acoustic and spatial features. Instead of performing frame-level OSD, we try to perform segment-level OSD. We come up with an attention pooling layer to model speech segments with variable length. Experimental results show that two-stream DFSMN with attention pooling can effectively model acoustic-spatial feature and significantly boost the performance of OSD, result in 3.5% (from 85.57% to 89.12%) absolute detection accuracy improvement compared to the baseline system.