|Osamu Ichikawa (Shiga University, Japan), Kaito Nakano (Shiga University, Japan), Takahiro Nakayama (University of Tokyo, Japan), Hajime Shirouzu (NIER, Japan)|
Attempts are being made to visualize the learning process by attaching microphones to students participating in group works conducted in classrooms, and subsequently, their speech using an automatic speech recognition (ASR) system. However, the voices of nearby students frequently become mixed with the output speech data, even when using close-talk microphones with noise robustness. To resolve this challenge, in this paper, we propose using multi-channel voice activity detection (VAD) to determine the speech segments of a target speaker while also referencing the output speech from the microphones attached to the other speakers in the group. The conducted evaluation experiments using the actual speech of middle school students during group work lessons showed that our proposed method significantly improves the frame error rate (38.7%) compared to that of the conventional technology, single-channel VAD (49.5%). In our view, conventional approaches, such as distributed microphone arrays and deep learning, are somewhat dependent on the temporal stationarity of the speakers’ positions. However, the proposed method is essentially a VAD process and thus works robustly. It is the practical and proven solution in a real classroom environment.