Common Voice

Mozilla's crowd-sourced multilingual ASR corpus covering 100+ languages; size, quality, and demographics vary widely by language.

Task: asr, multilingual
Languages: multi
Domain: crowdsourced read speech
License: CC0 1.0
Homepage: https://commonvoice.mozilla.org/en/datasets

Recommendation

Best choice when you need a permissively-licensed ASR corpus for a low-resource language — likely the only freely available option for many languages. English and a handful of major languages have hundreds of hours; smaller languages may have only a few hours. Download a specific version/language snapshot for reproducibility.

Getting the data

Downloadable via VoxKitchen (commonvoice, source: HuggingFace, size: —):

vkit docker download --tag slim commonvoice --root ./data/commonvoice

Suggested processing

A recommended VoxKitchen pipeline ships in the repository at voxkitchen/templates/pipelines/asr-training-data.yaml — run it with vkit docker run.