Improving Robustness of Speaker Verification Against Mimicked Speech

Kuruvachan K George, Santhosh Kumar C, Ramachandran K I, Ashish Panda

Making speaker verification (SV) systems robust to spoofed/mimicked speech attacks is very important to make its use effective in security applications. In this work, we show that using a proximal support vector machine backend classifier with i-vectors as inputs (i-PSVM) can help improve the performance of SV systems for mimicked speech as non-target trials. We compared our results with the state-of-the-art baseline i-vector with cosine distance scoring (i-CDS), i-vector with a backend SVM classifier (i-SVM) and cosine distance features with an SVM backend classifier (CDF-SVM) systems. In all experiments with SVM backend classifier, we over sampled the target utterance feature vectors before i-vector extraction using utterance partition followed by acoustic vector resampling (UP-AVR). UP-AVR helps solve the data imbalance problem, with a large number of non-target examples from the development data for training the models. In i-PSVM, proximity of the test utterance to the target and non-target class is the criteria for decision making while in i-SVM, the distance from the separating hyperplane is the criteria for the decision. It was seen that the i-PSVM approach is advantageous when tested with mimicked speech as non-target trials. This highlights that proximity to the target speakers is a better criteria for speaker verification for mimicked speech. Further, we note that weighting the target and non-target class examples helps us further fine tune the performance of i-PSVM. We then devised a strategy for estimating the weights for every example based on its cosine distance similarity with respect to the centroid of target class examples. The final i-PSVM with example based weighting scheme achieved an improvement of 3.39% absolute in EER when compared to the best baseline system, i-SVM. Subsequently, we fused the i-PSVM and i-SVM systems and results show that the performance of the combined system is better than the individual systems.

Switch Camera

Odyssey 2016

The Speaker and Language Recognition Workshop

Improving Robustness of Speaker Verification Against Mimicked Speech

Search in Audio

Speech Transcript

Related Recordings

Feature-based likelihood ratios for speaker recognition from linguistically-constrained formant-based i-vectors

Multi-channel i-vector combination for robust speaker verification in multi-room domestic environments