Skip to content

VoxForge

Community-contributed crowdsourced corpus of transcribed read speech collected to build free, open acoustic models for open-source ASR engines.

Recommendation

Good for free, copyleft-licensed ASR acoustic-model training and as a multilingual baseline, especially where GPL licensing is acceptable. Crowdsourced quality and per-language volume vary widely, and the strong copyleft can complicate combining it with differently-licensed data.

Getting the data

Obtain from the dataset homepage.

English is the largest of several languages (also de, ru, es, fr, it, ...). GPL applies to submitted audio; exact GPL version not authoritatively stated.

Suggested processing

A recommended VoxKitchen pipeline ships in the repository at voxkitchen/templates/pipelines/asr-training-data.yaml — run it with vkit docker run.