Silent versus modal multi-speaker speech recognition from ultrasound and video <BR>(3 minutes introduction)

Silent versus modal multi-speaker speech recognition from ultrasound and video
(3 minutes introduction)

Manuel Sam Ribeiro (Amazon, Poland), Aciel Eshky (Rasa Technologies, UK), Korin Richmond (University of Edinburgh, UK), Steve Renals (University of Edinburgh, UK)

We investigate multi-speaker speech recognition from ultrasound images of the tongue and video images of the lips. We train our systems on imaging data from modal speech, and evaluate on matched test sets of two speaking modes: silent and modal speech. We observe that silent speech recognition from imaging data underperforms compared to modal speech recognition, likely due to a speaking-mode mismatch between training and testing. We improve silent speech recognition performance using techniques that address the domain mismatch, such as fMLLR and unsupervised model adaptation. We also analyse the properties of silent and modal speech in terms of utterance duration and the size of the articulatory space. To estimate the articulatory space, we compute the convex hull of tongue splines, extracted from ultrasound tongue images. Overall, we observe that the duration of silent speech is longer than that of modal speech, and that silent speech covers a smaller articulatory space than modal speech. Although these two properties are statistically significant across speaking modes, they do not directly correlate with word error rates from speech recognition.

Search in Audio

Related Recordings

Remote smartphone-based speech collection: acceptance and barriers in individuals with major depressive disorder
(3 minutes introduction)

Judith Dineley , Grace Lavelle , Daniel Leightley , Faith Matcham , Sara Siddi , Maria Teresa Peñarrubia-María , Katie M. White , Alina Ivan , Carolin Oetzmann , Sara Simblett , Erin Dawe-Lane , Stuart Bruce , Daniel Stahl , Yatharth Ranjan , Zulqarnain Rashid , Pauline Conde , Amos A. Folarin , Josep Maria Haro , Til Wykes , Richard J.B. Dobson , Vaibhav A. Narayan , Matthew Hotopf , Björn W. Schuller , Nicholas Cummins , The RADAR-CNS Consortium ()

InterSpeech 2021

Silent versus modal multi-speaker speech recognition from ultrasound and video
(3 minutes introduction)

Search in Audio

Related Recordings

Remote smartphone-based speech collection: acceptance and barriers in individuals with major depressive disorder
(3 minutes introduction)

Remote smartphone-based speech collection: acceptance and barriers in individuals with major depressive disorder
(longer introduction)

InterSpeech 2021

Silent versus modal multi-speaker speech recognition from ultrasound and video (3 minutes introduction)

Search in Audio

Related Recordings

Remote smartphone-based speech collection: acceptance and barriers in individuals with major depressive disorder (3 minutes introduction)

Remote smartphone-based speech collection: acceptance and barriers in individuals with major depressive disorder (longer introduction)

Silent versus modal multi-speaker speech recognition from ultrasound and video
(3 minutes introduction)

Remote smartphone-based speech collection: acceptance and barriers in individuals with major depressive disorder
(3 minutes introduction)

Remote smartphone-based speech collection: acceptance and barriers in individuals with major depressive disorder
(longer introduction)