Skip to content

MGB-2 Challenge (Arabic Multi-Dialect Broadcast Media Recognition)

1200 hours of lightly supervised Arabic broadcast speech from 19 Al Jazeera Arabic TV programmes (2005-2015) — conversations, interviews, reports — with multi-dialect coverage.

Recommendation

The strongest publicly known Arabic broadcast ASR resource — pick for Arabic ASR or multi-dialect acoustic modeling at scale. Transcriptions are lightly supervised (not gold) and access is gated through the MGB organizers (QCRI); dialect labels are not exhaustive.

Getting the data

Obtain from the dataset homepage.

Includes ~110M-word LM text from aljazeera.net.

Suggested processing

A recommended VoxKitchen pipeline ships in the repository at voxkitchen/templates/pipelines/asr-training-data.yaml — run it with vkit docker run.