|Changhan Wang (Facebook, USA), Anne Wu (Facebook, USA), Juan Pino (Facebook, USA), Alexei Baevski (Facebook, USA), Michael Auli (Facebook, USA), Alexis Conneau (Facebook, USA)|
In this paper, we improve speech translation (ST) through effectively leveraging large quantities of unlabeled speech and text data in different and complementary ways. We explore both pretraining and self-training by using the large Libri-Light speech audio corpus and language modeling with CommonCrawl. Our experiments improve over the previous state of the art by 2.8 BLEU on average on all four considered CoVoST 2 language pairs via a simple recipe of combining wav2vec 2.0 pretraining, a single iteration of self-training and decoding with a language model. Different from existing work, our approach does not leverage any other supervision than ST data. Code and models are publicly released.