KeSpeech
1,542 h from 27,237 speakers across 34 cities, covering standard Mandarin and its 8 subdialects with transcription, speaker, and subdialect labels.
- Task: asr, multilingual
- Languages: zh
- Hours: 1542
- Domain: dialect
- License: see source terms
- Homepage: https://github.com/KeSpeech/KeSpeech
- Paper: https://openreview.net/forum?id=b3Zoeq2sCLq
Recommendation
Excellent for Mandarin ASR, accent/subdialect identification, and speaker recognition, with parallel Mandarin-vs-subdialect recordings. Choose it when dialectal robustness or large speaker diversity matters. Download requires accepting a custom usage agreement.
Getting the data
Obtain from the dataset homepage.
NeurIPS 2021 Datasets & Benchmarks. Download requires accepting a usage agreement; covers 8 subdialects.
Suggested processing
A recommended VoxKitchen pipeline ships in the repository at voxkitchen/templates/pipelines/asr-training-data.yaml — run it with vkit docker run.