InterSpeech 2021

The Impact of Forced-Alignment Errors on Automatic Pronunciation Evaluation
(Oral presentation)

Vikram C. Mathad (Arizona State University, USA), Tristan J. Mahr (UW–Madison, USA), Nancy Scherer (Arizona State University, USA), Kathy Chapman (University of Utah, USA), Katherine C. Hustad (UW–Madison, USA), Julie Liss (Arizona State University, USA), Visar Berisha (Arizona State University, USA)
Automatic evaluation of phone-level pronunciation scores typically involves two stages: (1) automatic phonetic segmentation via text-constrained phoneme alignment and (2) quantification of acoustic deviation for each phoneme-level relative to a database of correctly-pronounced speech. It’s clear that the second stage depends on the first. That is, if there is misalignment, the acoustic deviation will also be impacted. In this paper, we analyzed the impact of alignment error on a measure of goodness of pronunciation. We computed (1) automatic pronunciation scores using force-aligned samples, (2) the forced-alignment error rate, and (3) acoustic deviation using manually-aligned samples. We used a bivariate linear regression model to characterize the contributions of forced alignment errors and acoustic deviation on the automatic pronunciation scores. This was done across two different children speech databases, namely children with cleft lip/palate and typically developing children between the ages of 3–6 years. The analysis shows that, for speech from typically-developing children, most of the variation in the automatic pronunciation scores is explained by acoustic deviation, with the errors in forced alignment playing a relatively minor role. The forced alignment errors have a small but significant downstream impact on pronunciation assessment for children with cleft lip/palate.