InterSpeech 2021

Automatic Speech Recognition of Disordered Speech: Personalized models outperforming human listeners on short phrases
(Oral presentation)

Jordan R. Green (MGH Institute of Health Professions, USA), Robert L. MacDonald (Google, USA), Pan-Pan Jiang (Google, USA), Julie Cattiau (Google, USA), Rus Heywood (Google, USA), Richard Cave (MND Association, UK), Katie Seaver (MGH Institute of Health Professions, USA), Marilyn A. Ladewig (Cerebral Palsy Associations of New York State, USA), Jimmy Tobin (Google, USA), Michael P. Brenner (Google, USA), Philip C. Nelson (Google, USA), Katrin Tomanek (Google, USA)
This study evaluated the accuracy of personalized automatic speech recognition (ASR) for recognizing disordered speech from a large cohort of individuals with a wide range of underlying etiologies using an open vocabulary. The performance of these models was benchmarked relative to that of expert human transcribers and two different speaker-independent ASR models trained on typical speech. 432 individuals with self-reported disordered speech recorded at least 300 short phrases using a web-based application. Word error rates (WERs) were estimated for three different ASR models and for human transcribers. Metadata were collected to evaluate the potential impact of participants, atypical speech characteristics, and technical factors on recognition accuracy. Personalized models outperformed human transcribers with median and max recognition accuracy gains of 9% and 80%, respectively. The accuracies of personalized models were high (median WER: 4.6%) and better than those of speaker-independent models (median WER: 31%). The most significant improvements were for the most severely affected speakers. Low signal-to-noise ratio and fewer training utterances were associated with poor word recognition, even for speakers with mild speech impairments. Our results demonstrate the efficacy of personalized ASR models in recognizing a wide range of speech impairments and severities and using an open vocabulary.