|Brecht Desplanques, Kris Demuynck, Jean-Pierre Martens|
In this work we propose to integrate a soft voice activity detection (VAD) module in an iVector-based speaker segmentation system. As speaker change detection should be based on speaker information only, we want it to disregard the non-speech frames by applying speech posteriors during the estimation of the Baum-Welch statistics. The speaker segmentation relies on speaker factors which are extracted on a frame-by-frame basis using an eigenvoice matrix. Speaker boundaries are inserted at positions where the distance between the speaker factors at both sides is large. A Mahalanobis distance seems capable of suppressing the effects of differences in the phonetic content at both sides, and therefore, to generate more accurate speaker boundaries. This iVector-based segmentation significantly outperforms Bayesian Information Criterion (BIC) segmentation methods and can be made adaptive on a file-by-file basis in a two-pass approach. Experiments on the COST278 multilingual broadcast news database show significant reductions of the boundary detection error rate by integrating the soft VAD. Furthermore, the more accurate boundaries induce a slight improvement of the iVector Probabilistic Linear Discriminant Analysis system that is employed for speaker clustering.