Skip to content

MyST Children's Conversational Speech

~470 hours of English conversational speech from 1371 students in grades 3-5 interacting with a virtual science tutor across eight FOSS-curriculum science topics, produced by Boulder Learning.

Recommendation

Pick for child-speech ASR where conversational, open-ended tutoring dialogue is needed — one of the largest English children's-speech corpora available. Caveats — paid LDC distribution, only ~45% of utterances are transcribed, and the grade range is 3-5 (not K-2).

Getting the data

Obtain from the dataset homepage.

Distributed via LDC under the MyST Children's Conversational Speech Agreement; commercial use requires contacting Boulder Learning Inc.

Suggested processing

A recommended VoxKitchen pipeline ships in the repository at voxkitchen/templates/pipelines/asr-training-data.yaml — run it with vkit docker run.