IMDA National Speech Corpus (NSC)

Large-scale Singapore-English speech corpus from IMDA — ~2000 hours of orthographically transcribed read speech plus ~1000 hours of conversational speech, designed for ASR research on Singapore-accented English.

Task: asr
Languages: en
Hours: 3000
Domain: read + conversational accented
License: see source terms
Homepage: https://www.imda.gov.sg/how-we-can-help/national-speech-corpus
Paper: https://www.isca-archive.org/interspeech_2019/koh19_interspeech.html

Recommendation

Top pick for Singapore-English / Southeast-Asian-accented ASR training and adaptation, and one of the largest openly available accented-English corpora. Choose when you need locally-relevant vocabulary, code-mixing patterns, or non-US/UK English coverage.

Getting the data

Obtain from the dataset homepage.

Distributed under IMDA's licence (often referenced as Singapore Open Data Licence v1.0); request-based access via nsc@imda.gov.sg; multi-part release (Parts 1-6) with substantial storage requirements.

Suggested processing

A recommended VoxKitchen pipeline ships in the repository at voxkitchen/templates/pipelines/asr-training-data.yaml — run it with vkit docker run.