|Camryn Terblanche (University of Cape Town, South Africa), Philip Harrison (University of York, UK), Amelia J. Gully (University of York, UK)|
Over the past few years attention has been focused on the automatic detection of spoofing in the context of automatic speaker verification (ASV) systems. However, little is known about how well humans perform at detecting spoofed speech, particularly under degraded conditions. Using the latest synthesis technologies from ASVspoof 2019, this paper explores human judgements of speech authenticity by considering three common channel degradations — a GSM network, a VoIP network, and background noise — in conjunction with varying synthesis quality. The results reveal that channel degradation reduces the size of the perceptual difference between genuine and spoofed speech, and overall participants correctly identified human and spoofed speech only 56% of the time. In background noise and GSM transmission, lower-quality synthetic speech was judged as more human, and in VoIP transmission all speech, including genuine recordings, was judged as less human. Under all conditions, state-of-the-art synthetic speech was judged as human, or more human than, genuine recorded speech. The paper also considers the listener factors which may contribute to an individual’s spoofing detection performance, and finds that a listener’s familiarity with the accents involved, their age, and the audio equipment used for playback, have an effect on their spoofing detection performance.