Skip to content

THCHS-30 (Tsinghua Chinese 30-hour Database)

A 30-hour Mandarin read-speech corpus from CSLT Tsinghua, 16 kHz, with word/syllable/phone-level transcriptions and 50 speakers in a quiet office.

Recommendation

Great lightweight baseline for Mandarin ASR experiments, recipe smoke-tests, and teaching/demo pipelines — small size, permissive Apache-2.0. Pair with AISHELL-1/2 or WenetSpeech for production scale.

Getting the data

Downloadable via VoxKitchen (thchs30, source: openslr, size: 23 MB - 6.0 GB):

vkit docker download --tag slim thchs30 --root ./data/thchs30

Subsets: main, resource, test-noise.

Distributed via OpenSLR mirrors; ~6.4 GB compressed.

Suggested processing

A recommended VoxKitchen pipeline ships in the repository at voxkitchen/templates/pipelines/asr-training-data.yaml — run it with vkit docker run.