InterSpeech 2021

Hierarchical Phone Recognition with Compositional Phonetics
(3 minutes introduction)

Xinjian Li (Carnegie Mellon University, USA), Juncheng Li (Carnegie Mellon University, USA), Florian Metze (Carnegie Mellon University, USA), Alan W. Black (Carnegie Mellon University, USA)
There is growing interest in building phone recognition systems for low-resource languages as the majority of languages do not have any writing systems. Phone recognition systems proposed so far typically derive their phone inventory from the training languages, therefore the derived inventory could only cover a limited number of phones existing in the world. It fails to recognize unseen phones in low-resource or zero-resource languages. In this work, we tackle this problem with a hierarchical model, in which we explicitly model three different entities in a hierarchical manner: phoneme, phone, and phonological articulatory attributes. In particular, we decompose phones into articulatory attributes and compute the phone embedding from the attribute embedding. The model would first predict the distribution over the phones using their embeddings, next, the language-independent phones are aggregated to the language-dependent phonemes and then optimized by the CTC loss. This compositional approach enables us to recognize phones even they do not appear in the training set. We evaluate our model on 47 unseen languages and find the proposed model outperforms baselines by 13.1% PER.