|Vasileios Papadourakis (Amazon, USA), Markus Müller (Amazon, USA), Jing Liu (Amazon, USA), Athanasios Mouchtaris (Amazon, USA), Maurizio Omologo (Amazon, USA)|
End-to-end automatic speech recognition systems map a sequence of acoustic features to text. In modern systems, text is encoded to grapheme subwords which are generated by methods designed for text processing tasks and therefore don’t model or take advantage of the statistics of the acoustic features. Here, we present a novel method for generating grapheme subwords that are derived from phoneme sequences, therefore capturing phonetical statistics. The phonetically induced subwords can be used for training and inference in any system that benefits from subwords, regardless of architecture and without the need of a pronunciation lexicon. We compare our method to other commonly used methods, which are based on text statistics or on text-phoneme correspondence and present experiments on CTC and RNN-T architectures, evaluating subword sets of different sizes. We find that our phonetically induced subwords can improve performance of RNN-T models with relative improvements of up to 15.21% compared to other subword methods.