InterSpeech 2021

Exploring using the outputs of different automatic speech recognition paradigms for acoustic- and BERT-based Alzheimer's Dementia detection through spontaneous speech
(3 minutes introduction)

Yilin Pan (University of Sheffield, UK), Bahman Mirheidari (University of Sheffield, UK), Jennifer M. Harris (University of Manchester, UK), Jennifer C. Thompson (University of Manchester, UK), Matthew Jones (University of Manchester, UK), Julie S. Snowden (University of Manchester, UK), Daniel Blackburn (University of Sheffield, UK), Heidi Christensen (University of Sheffield, UK)
Exploring acoustic and linguistic information embedded in spontaneous speech recordings has proven to be efficient for automatic Alzheimer’s dementia detection. Acoustic features can be extracted directly from the audio recordings, however, linguistic features, in fully automatic systems, need to be extracted from transcripts generated by an automatic speech recognition (ASR) system. We explore two state-of-the-art ASR paradigms, Wav2vec2.0 (for transcription and feature extraction) and time delay neural networks (TDNN) on the ADReSSo dataset containing recordings of people describing the Cookie Theft (CT) picture. As no manual transcripts are provided, we train an ASR system using our in-house CT data. We further investigate the use of confidence scores and multiple ASR hypotheses to guide and augment the input for the BERT-based classification. In total, five models are proposed for exploring how to use the audio recordings only for acoustic and linguistic information extraction. The test results on best acoustic-only and best linguistic-only are 74.65% and 84.51% respectively (representing a 15% and 9% relative increase to published baseline results).