InterSpeech 2021

A comparison of the accuracy of Dissen and Keshets’s (2016) DeepFormants and traditional LPC methods for semi-automatic speaker recognition
(3 minutes introduction)

Thomas Coy (University of York, UK), Vincent Hughes (University of York, UK), Philip Harrison (University of York, UK), Amelia J. Gully (University of York, UK)
There is a growing trend in the field of forensic speech science towards integrating the vanguard of speech technology with traditional linguistic methods in pursuit of both scalable (i.e. automatable) and accurate evidential methods. To this end, this paper investigates DeepFormants, a DNN formant estimator which its creators, Dissen and Keshet [1], claim constitutes an accurate tool ready for use by linguists. In the present paper, DeepFormants is integrated into semi-automatic speaker recognition systems using long-term formant distributions and compared against systems using traditional linear predictive coding. The readiness of the tool is assessed on overall speaker recognition performance, measured using equal error rates (EER) and the log LR cost functions (Cllr). In high-quality conditions, DeepFormants outperforms the best performing LPC systems. Much poorer overall performance is found in channel mismatch conditions for DeepFormants, suggesting it is not adaptable to conditions it was not originally trained on. However, this is also true of LPC methods, raising questions over the validity of using formant analysis at all in such cases. A major benefit of DeepFormants over LPC is that the analyst does not need to specify settings. We discuss the implications of this with regard to results for individual speakers.