GigaSpeech 2
Large-scale multi-domain ASR for low-resource Southeast Asian languages (Thai, Indonesian, Vietnamese), built by automated YouTube crawling and transcription (~30k h raw, ~22k h refined).
- Task: asr, multilingual
- Languages: th, id, vi
- Hours: 30000
- Domain: youtube
- License: see source terms
- Homepage: https://github.com/SpeechColab/GigaSpeech2
- Paper: https://arxiv.org/abs/2406.11546
Recommendation
Pick this for low-resource SE Asian ASR where labeled data is scarce; the refined splits give usable labels plus professional dev/test sets. Audio is gated and restricted to non-commercial research/education, and labels are machine-generated/auto-refined, so quality varies by split.
Getting the data
Obtain from the dataset homepage.
HF tags Apache-2.0 but access is gated with non-commercial research/education terms; SpeechColab does not own the audio copyright.
Suggested processing
A recommended VoxKitchen pipeline ships in the repository at voxkitchen/templates/pipelines/asr-training-data.yaml — run it with vkit docker run.