InterSpeech 2021

Automatic extraction of speech rhythm descriptors for speech intelligibility assessment in the context of head and neck cancers
(Oral presentation)

Robin Vaysse (IRIT (UMR 5505), France), Jérôme Farinas (IRIT (UMR 5505), France), Corine Astésano (URI Octogone-Lordat (EA 4156), France), Régine André-Obrecht (IRIT (UMR 5505), France)
The temporal dimension of speech acoustics is rarely taken into account in automatic models for Speech Intelligibility evaluation, although the rhythmic recurrence of phonemes, syllables and prosodic groups are allegedly good predictors of speech intelligibility. The present study aims at unravelling those automatic parameters that best account for the different levels of the speech signal’s rhythmic structure, and to evaluate their correlation with a perceptual intelligibility measure. The parameters are extracted from the Fourier Transform of the amplitude modulation of the signal (Envelope Modulation Spectrum) [1, 2]. A Lasso linear model for feature selection is first implemented to select the most relevant parameters, and a SVR regression analysis is run to reveal the best parameters’ combination. Our analyses of EMS, using data from the French corpora of cancer speech C2SI [3], show strong performances of the automatic prediction, with a correlation of 0.70 between our model and an intelligibility evaluation score by speech-pathologists. In particular, the highest correlation with speech intelligibility lies in the ratio between the energy in the low frequency band (0.5–4 Hz that represents slow rhythmic modulations indicative of prosodic groups) and in the higher one (4–10 Hz that represents fast rhythmic modulations like phonemes).