Skip to content

GigaSpeech 2

Large-scale multi-domain ASR for low-resource Southeast Asian languages (Thai, Indonesian, Vietnamese), built by automated YouTube crawling and transcription (~30k h raw, ~22k h refined).

Recommendation

Pick this for low-resource SE Asian ASR where labeled data is scarce; the refined splits give usable labels plus professional dev/test sets. Audio is gated and restricted to non-commercial research/education, and labels are machine-generated/auto-refined, so quality varies by split.

Getting the data

Obtain from the dataset homepage.

HF tags Apache-2.0 but access is gated with non-commercial research/education terms; SpeechColab does not own the audio copyright.

Suggested processing

A recommended VoxKitchen pipeline ships in the repository at voxkitchen/templates/pipelines/asr-training-data.yaml — run it with vkit docker run.