InterSpeech 2021

Label Embedding for Chinese Grapheme-to-Phoneme Conversion
(3 minutes introduction)

Eunbi Choi (KAIST, Korea), Hwa-Yeon Kim (Naver, Korea), Jong-Hwan Kim (Naver, Korea), Jae-Min Kim (Naver, Korea)
Chinese grapheme-to-phoneme (G2P) conversion plays a significant role in text-to-speech systems by generating pronunciations corresponding to Chinese input characters. The main challenge in Chinese G2P conversion is polyphone disambiguation, which requires selecting the appropriate pronunciation among several candidates. In polyphone disambiguation, calculating probabilities for the entire pronunciations is unnecessary since each Chinese character has only a few (mostly two or three) candidate pronunciations. In this study, we introduce a label embedding approach that matches the character embedding with the closest label embedding among the possible candidates. Specifically, negative sampling and triplet loss were applied to maximize the difference between the correct embedding and the other candidate embeddings. Experimental results show that the label embedding approach improved the polyphone disambiguation accuracy by 4.50% and 1.74% on two datasets compared to the one-hot label classification approach. Moreover, the bidirectional long short-term memory model with the label embedding approach outperformed the previous most advanced model, BERT, demonstrating outstanding performance in polyphone disambiguation. Lastly, we discuss the effect of contextual information in character embeddings on the G2P conversion task.