InterSpeech 2021

Modular Multi-Modal Attention Network for Alzheimer's Disease Detection Using Patient Audio and Language Data
(longer introduction)

Ning Wang (Stevens Institute of Technology, USA), Yupeng Cao (Stevens Institute of Technology, USA), Shuai Hao (Stevens Institute of Technology, USA), Zongru Shao (CASUS, Germany), K.P. Subbalakshmi (Stevens Institute of Technology, USA)
In this work, we propose a modular multi-modal architecture to automatically detect Alzheimer’s disease using the dataset provided in the ADReSSo challenge. Both acoustic and text-based features are used in this architecture. Since the dataset provides only audio samples of controls and patients, we use Google cloud-based speech-to-text API to automatically transcribe the audio files to extract text-based features. Several kinds of audio features are extracted using standard packages. The proposed approach consists of 4 networks: C-attention-acoustic network (for acoustic features only), C-Attention-FT network (for linguistic features only), C-Attention-Embedding network (for language embeddings and acoustic embeddings), and a unified network (uses all of those features). The architecture combines attention networks and a convolutional neural network (C-Attention network) in order to process these features. Experimental results show that the C-Attention-Unified network with Linguistic features and X-Vector embeddings achieves the best accuracy of 80.28% and F1 score of 0.825 on the test dataset.