InterSpeech 2021

END-to-END Cross-Lingual Spoken Language Understanding Model with Multilingual Pretraining
(3 minutes introduction)

Xianwei Zhang (Tsinghua University, China), Liang He (Tsinghua University, China)
The spoken language understanding (SLU) plays an essential role in the field of human-computer interaction. Most of the current SLU systems are cascade systems of automatic speech recognition (ASR) and natural language understanding (NLU). Error propagation and scarcity of annotated speech data are two common difficulties for resource-poor languages. To solve them, we propose a simple but effective end-to-end cross-lingual spoken language understanding model based on XLSR-53, which is a pretrained model in 53 languages by the Facebook research team. The end-to-end approach avoids error propagation and the multilingual pretraining reduces data annotation requirements. Our proposed method achieves 99.71% on the Fluent Speech Commands (FSC) English database and 79.89% on the CATSLU-MAP Chinese database, in intent classification accuracy. To the best of our knowledge, the former is the reported best result on the FSC database.