WenetSpeech
10,000-hour large-scale Mandarin ASR corpus collected from YouTube and podcasts with automatic labelling.
- Task: asr
- Languages: zh
- Hours: 10000
- Domain: in-the-wild (YouTube, podcasts, audiobooks)
- License: see source terms
- Homepage: https://github.com/wenet-e2e/WenetSpeech
- Paper: https://arxiv.org/abs/2110.03370
Recommendation
Go-to corpus for large-scale Mandarin ASR where AISHELL-1 is too clean or too small. Automatic labels introduce noise — expect to filter with the data-cleaning pipeline. Non-commercial restrictions apply; check the source terms before production use.
Getting the data
Obtain from the dataset homepage.
Register for access and download via the WenetSpeech toolkit. The full corpus requires several TB of storage. Quality varies by subset — the "L" training set has the most automatic-label noise.
Suggested processing
A recommended VoxKitchen pipeline ships in the repository at voxkitchen/templates/pipelines/data-cleaning.yaml — run it with vkit docker run.