Skip to content

KsponSpeech

~969 h of Korean spontaneous open-domain dialogue from ~2,000 native speakers, with dual orthographic + pronunciation transcription.

Recommendation

The standard large-scale corpus for Korean spontaneous-speech ASR and the reference for Korean ASR toolkits. Distributed via the Korean government AIHub portal under custom terms requiring registration/approval, which can be a barrier for non-Korean users.

Getting the data

Obtain from the dataset homepage.

Requires an AIHub account and agreement to AIHub usage terms; not a standard open license.

Suggested processing

A recommended VoxKitchen pipeline ships in the repository at voxkitchen/templates/pipelines/asr-training-data.yaml — run it with vkit docker run.