GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio <BR>(3 minutes introduction)

GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio
(3 minutes introduction)

Guoguo Chen (SpeechColab, China), Shuzhou Chai (SpeechColab, China), Guan-Bo Wang (SpeechColab, China), Jiayu Du (SpeechColab, China), Wei-Qiang Zhang (SpeechColab, China), Chao Weng (Tencent, China), Dan Su (Tencent, China), Daniel Povey (Xiaomi, China), Jan Trmal (Johns Hopkins University, USA), Junbo Zhang (Xiaomi, China), Mingjie Jin (Tencent, China), Sanjeev Khudanpur (Johns Hopkins University, USA), Shinji Watanabe (Johns Hopkins University, USA), Shuaijiang Zhao (KE, China), Wei Zou (KE, China), Xiangang Li (KE, China), Xuchen Yao (Seasalt AI, USA), Yongqing Wang (Xiaomi, China), Zhao You (Tencent, China), Zhiyong Yan (Xiaomi, China)

This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 33,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 33,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/ validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality. Baseline systems are provided for popular speech recognition toolkits, namely Athena, ESPnet, Kaldi and Pika.

Loading player

Search in Audio

Related Recordings

The Multilingual TEDx Corpus for Speech Recognition and Translation
(3 minutes introduction)

Elizabeth Salesky , Matthew Wiesner , Jacob Bremerman , Roldano Cattoni , Matteo Negri , Marco Turchi , Douglas W. Oard , Matt Post

AusKidTalk: An Auditory-Visual Corpus of 3- to 12-year-old Australian Children’s Speech
(3 minutes introduction)

Beena Ahmed , Kirrie J. Ballard , Denis Burnham , Tharmakulasingam Sirojan , Hadi Mehmood , Dominique Estival , Elise Baker , Felicity Cox , Joanne Arciuli , Titia Benders , Katherine Demuth , Barbara Kelly , Chloé Diskin-Holdaway , Mostafa Shahin , Vidhyasaharan Sethu , Julien Epps , Chwee Beng Lee , Eliathamby Ambikairajah

InterSpeech 2021

GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio (3 minutes introduction)

Search in Audio

Related Recordings

The Multilingual TEDx Corpus for Speech Recognition and Translation (3 minutes introduction)

AusKidTalk: An Auditory-Visual Corpus of 3- to 12-year-old Australian Children’s Speech (3 minutes introduction)

GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio
(3 minutes introduction)

The Multilingual TEDx Corpus for Speech Recognition and Translation
(3 minutes introduction)

AusKidTalk: An Auditory-Visual Corpus of 3- to 12-year-old Australian Children’s Speech
(3 minutes introduction)