Skip to content

Shrutilipi

A 6400+ hour labelled ASR corpus across 12 Indian languages mined from All India Radio news bulletins by AI4Bharat, with document-level audio-text alignment.

Recommendation

Best-in-class scale for Indic ASR pretraining and low-resource fine-tuning — Bengali, Hindi, Tamil, Telugu, and other Indian languages underserved by Western corpora. Broadcast-news domain skews formal/read-aloud register, so complement with conversational data for spoken-dialogue use cases.

Getting the data

Obtain from the dataset homepage.

12 languages — bn, gu, hi, kn, ml, mr, or, pa, sa, ta, te, ur. Also mirrored on Hugging Face as ai4bharat/Shrutilipi.

Suggested processing

A recommended VoxKitchen pipeline ships in the repository at examples/pipelines/fleurs-multilingual.yaml — run it with vkit docker run.