Odyssey 2020

The Speaker and Language Recognition Workshop

On Early-stop Clustering for Speaker Diarization

Liping Chen, Kongaik Lee, Lei He, Frank Soong
We propose an early-stop strategy to improve the performance of speaker diarization system based on agglomerative hierarchical clustering (AHC). The proposed strategy generates more clusters than the given number of speakers. Based on these initial clusters, an exhaustive search is used to find the best possible combinations of clusters to match the number of speakers. We show that final clusters are more homogeneous with their corresponding speakers, i.e., with less mixing speech frames from interfering speakers. For the case of unknown number of speakers, we first estimate the number of speakers with the speaker similarity score matrix across all initial clusters. Our experiments conducted on DIHARD shows that the proposed early-stop clustering combined with speaker cluster selection leads to a better cluster purity in speaker and better diarization performance than the conventional AHC. Moreover, in the condition where the number of speakers was not given, with the proposed techniques to estimate the number of speakers and select the clusters corresponding to the speakers, the system performance was stable with regards to different stop thresholds.