InterSpeech 2021

Using Transposed Convolution for Articulatory-to-Acoustic Conversion from Real-Time MRI Data
(3 minutes introduction)

Ryo Tanji (Tokyo University of Science, Japan), Hidefumi Ohmura (Tokyo University of Science, Japan), Kouichi Katsurada (Tokyo University of Science, Japan)
We herein propose a deep neural network-based model for articulatory-to-acoustic conversion from real-time MRI data. Although rtMRI, which can record entire articulatory organs with a high resolution, has an advantage in articulatory-to-acoustic conversion, it has a relatively low sampling rate. To address this, we incorporated the super-resolution technique in the temporal dimension with a transposed convolution. With the use of transposed convolution, the resolution can be increased by applying the inversion process of resolution reduction of a standard CNN. To evaluate the performance on the datasets with different temporal resolutions, we conducted experiments using two datasets: USC-TIMIT and Japanese rtMRI dataset. Results of the experiments performed using mel-cepstrum distortion and PESQ showed that transposed convolution is effective for generating accurate acoustic features. We also confirmed that increasing the magnification of the super-resolution leads to an improvement in the PESQ score.