MGB-2 Challenge (Arabic Multi-Dialect Broadcast Media Recognition)
1200 hours of lightly supervised Arabic broadcast speech from 19 Al Jazeera Arabic TV programmes (2005-2015) — conversations, interviews, reports — with multi-dialect coverage.
- Task: asr, multilingual
- Languages: ar
- Hours: 1200
- Domain: Arabic TV broadcast (Al Jazeera)
- License: see source terms
- Homepage: http://www.mgb-challenge.org/MGB-2.html
- Paper: https://arxiv.org/abs/1609.05625
Recommendation
The strongest publicly known Arabic broadcast ASR resource — pick for Arabic ASR or multi-dialect acoustic modeling at scale. Transcriptions are lightly supervised (not gold) and access is gated through the MGB organizers (QCRI); dialect labels are not exhaustive.
Getting the data
Obtain from the dataset homepage.
Includes ~110M-word LM text from aljazeera.net.
Suggested processing
A recommended VoxKitchen pipeline ships in the repository at voxkitchen/templates/pipelines/asr-training-data.yaml — run it with vkit docker run.