InterSpeech 2021

Knowledge Distillation from Multi-Modality to Single-Modality for Person Verification
(Oral presentation)

Leying Zhang (SJTU, China), Zhengyang Chen (SJTU, China), Yanmin Qian (SJTU, China)
Voice and face are two important biometric characteristics that can be used for person identity verification. Previous works have proved the strong complementarity between audio and visual modalities in person verification tasks that multi-modality system can achieve significant performance improvement compared to single-modality system. However, due to the limitations in the real world, it is hard to access both audio and visual data at the same time. In this paper, we investigate several strategies to distill the knowledge from a multi-modality system and transfer it to the single-modality system in a teacher-student mode. We applied the knowledge distillation at three different levels: label level, embedding level, and distribution level. All the experiments are based on the VoxCeleb dataset. The results show that the visual single-modality system achieves 10% EER (equal error rate) improvement on the VoxCeleb1 evaluation set using our proposed knowledge distillation method. Besides, the improvement on the audio system is only reflected on part of the evaluation trials, and we give a detailed analysis for this phenomenon.