Skip to content

AISHELL-1

170-hour open Mandarin speech corpus recorded in clean studio conditions; the standard Chinese ASR benchmark.

Recommendation

The go-to starting point for Mandarin ASR. Clean studio recordings with full transcripts. Non-commercial license — check before production use. Not representative of spontaneous or accented Mandarin; supplement with WenetSpeech for broader coverage.

Getting the data

Downloadable via VoxKitchen (aishell, source: openslr, size: 1 MB - 14.5 GB):

vkit docker download --tag slim aishell --root ./data/aishell

Subsets: data_aishell, resource_aishell.

Suggested processing

A recommended VoxKitchen pipeline ships in the repository at voxkitchen/templates/pipelines/asr-training-data.yaml — run it with vkit docker run.