|Tianxiang Chen, Avrosh Kumar, Parav Nagarsheth, Ganesh Sivaraman, Elie Khoury|
Audio Deepfakes, technically known as logical-access voice spoofing attacks, have become an increased threat on voice interfaces due to the recent breakthroughs in speech synthesis and voice conversion technologies. Effectively detecting these attacks is critical to many speech applications including automatic speaker verification systems. As new types of speech synthesis and voice conversion techniques are emerging rapidly, the generalization ability of spoofing countermeasures is becoming an increasingly critical challenge to solve. This paper focuses on overcoming this issue by using large margin cosine loss function (LMCL) and frequency masking layer to force the neural network to learn more robust feature embeddings. We evaluate the performance of the proposed system on the ASVspoof 2019 logical access (LA) dataset. Additionally, we evaluate it on a noisy version of the ASVspoof 2019 dataset using publicly available noises to simulate more realistic scenarios. Finally, we evaluate the proposed system on a copy of the dataset that is logically replayed through the telephony channel to simulate a spoofing attack scenario in the call center. Our baseline system is based on residual neural network, and has acheived the lowest equal error rate (EER) of 4.04% at the ASVspoof 2019 challenge among all single-system submissions from all participants. Furthermore, the improved system proposed in this paper achieves an EER of 1.26%, which is a reduction by a factor of three over our previous state-of-the-art system.