THCHS-30 (Tsinghua Chinese 30-hour Database)

A 30-hour Mandarin read-speech corpus from CSLT Tsinghua, 16 kHz, with word/syllable/phone-level transcriptions and 50 speakers in a quiet office.

Task: asr
Languages: zh
Hours: 30
Domain: read
License: Apache-2.0
Homepage: https://www.openslr.org/18/
Paper: https://arxiv.org/abs/1512.01882

Recommendation

Great lightweight baseline for Mandarin ASR experiments, recipe smoke-tests, and teaching/demo pipelines — small size, permissive Apache-2.0. Pair with AISHELL-1/2 or WenetSpeech for production scale.

Getting the data

Downloadable via VoxKitchen (thchs30, source: openslr, size: 23 MB - 6.0 GB):

vkit docker download --tag slim thchs30 --root ./data/thchs30

Subsets: main, resource, test-noise.

Distributed via OpenSLR mirrors; ~6.4 GB compressed.

Suggested processing

A recommended VoxKitchen pipeline ships in the repository at voxkitchen/templates/pipelines/asr-training-data.yaml — run it with vkit docker run.