InterSpeech 2021

Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization
(3 minutes introduction)

Gonçal V. Garcés Díaz-Munío (Universitat Politècnica de València, Spain), Joan-Albert Silvestre-Cerdà (Universitat Politècnica de València, Spain), Javier Jorge (Universitat Politècnica de València, Spain), Adrià Giménez Pastor (Universitat Politècnica de València, Spain), Javier Iranzo-Sánchez (Universitat Politècnica de València, Spain), Pau Baquero-Arnal (Universitat Politècnica de València, Spain), Nahuel Roselló (Universitat Politècnica de València, Spain), Alejandro Pérez-González-de-Martos (Universitat Politècnica de València, Spain), Jorge Civera (Universitat Politècnica de València, Spain), Albert Sanchis (Universitat Politècnica de València, Spain), Alfons Juan (Universitat Politècnica de València, Spain)
We introduce Europarl-ASR, a large speech and text corpus of parliamentary debates including 1 300 hours of transcribed speeches and 70 million tokens of text in English extracted from European Parliament sessions. The training set is labelled with the Parliament’s non-fully-verbatim official transcripts, time-aligned. As verbatimness is critical for acoustic model training, we also provide automatically noise-filtered and automatically verbatimized transcripts of all speeches based on speech data filtering and verbatimization techniques. Additionally, 18 hours of transcribed speeches were manually verbatimized to build reliable speaker-dependent and speaker-independent development/test sets for streaming ASR benchmarking. The availability of manual non-verbatim and verbatim transcripts for dev/test speeches makes this corpus useful for the assessment of automatic filtering and verbatimization techniques. This paper describes the corpus and its creation, and provides off-line and streaming ASR baselines for both the speaker-dependent and speaker-independent tasks using the three training transcription sets. The corpus is publicly released under an open licence.