|Qingjian Lin, Tingle Li, Lin Yang, Junjie Wang, Ming Li|
A tendency exists that neural network approaches become increasingly popular among submodules of speaker diarization such as voice activity detection, speaker embedding extraction and clustering. Still, end-to-end speaker diarization training remains a challenging task, partly due to hard loss design for the speaker-label ambiguity problem. Permutation-invariant training (PIT) loss could be a possible solution, but its time complexity exceeds O(N!) where N indicates the number of speakers in the audio. In this paper, we improve the PIT loss and further propose a novel optimal mapping loss which directly computes the best matches between output speakers and target speakers. Our proposed loss is based on the Hungarian algorithm and successfully reduces the time complexity to about O(N3) for large N, while keeping the same performance as PIT loss.