Odyssey 2012

The Speaker and Language Recognition Workshop

On the use of Agglomerative and Spectral Clustering in Speaker Diarization of Meetings

Presented by:
Jordi Luque
Jordi Luque and Javier Hernando

In this paper, we present a clustering algorithm for speaker diarization based on spectral clustering. State-of-the-art diarization systems are based on agglomerative hierarchical clustering using BIC criterion and other statistical metrics among clusters which results in a high computational cost. Our proposal avoids the use of such metrics applying Euclidean distances on the eigenvectors computed from the normalized graph Laplacian. A hybrid system is proposed in where HMM/GMM modeling and Viterbi alignment are still applied, but the BIC for merging and stopping criterion are substituted by a spectral clustering algorithm. Once an initial segmentation is obtained and after a Viterbi alignment by means an ergodic HMM, the remaining clusters are modeled by stacking the means of the Gaussians in a super vector. In such a space single value decomposition of the associated normalized graph Laplacian is computed. Most similar clusters are merged based on the Euclidean distances in resulting eigenspace. Cluster number estimation is based on analyzing eigenstructure of the similarity matrix by selecting a threshold on the eigenvalues gap. In experiments, this approach has obtained a comparable performance to the traditional AHC+BIC approach on the Rich Transcription conference evaluation data. Although it still relies on Gaussian modeling of clusters and Viterbi alignment, the proposed approach leads to a system which runs several times faster than traditional one.