InterSpeech 2021

Dynamic Multi-scale Convolution for Dialect Identification
(Oral presentation)

Tianlong Kong (Tsinghua University, China), Shouyi Yin (Tsinghua University, China), Dawei Zhang (Kwai, China), Wang Geng (Kwai, China), Xin Wang (Kwai, China), Dandan Song (Tsinghua University, China), Jinwen Huang (Kwai, China), Huiyu Shi (Tsinghua University, China), Xiaorui Wang (Kwai, China)
Time Delay Neural Networks (TDNN)-based methods are widely used in dialect identification. However, in previous work with TDNN application, subtle variant is being neglected in different feature scales. To address this issue, we propose a new architecture, named dynamic multi-scale convolution, which consists of dynamic kernel convolution, local multi-scale learning, and global multi-scale pooling. Dynamic kernel convolution captures features between short-term and long-term context adaptively. Local multi-scale learning, which represents multi-scale features at a granular level, is able to increase the range of receptive fields for convolution operation. Besides, global multi-scale pooling is applied to aggregate features from different bottleneck layers in order to collect information from multiple aspects. The proposed architecture significantly outperforms state-of-the-art system on the AP20-OLR-dialect-task of oriental language recognition (OLR) challenge 2020, with the best average cost performance (Cavg) of 0.067 and the best equal error rate (EER) of 6.52%. Compared with the known best results, our method achieves 9% of Cavg and 45% of EER relative improvement, respectively. Furthermore, the parameters of proposed model are 91% fewer than the best known model.