Skip to content

TED-LIUM 3

452-hour English ASR corpus of TED talks with manual and automatic transcriptions; suitable for lecture/talk domain ASR research.

Recommendation

Good choice for spontaneous (but well-articulated) English ASR, contrasting with the read-speech style of LibriSpeech. Non-commercial license. Use the SPH or WAV releases — the SPH format needs conversion. Useful for domain-shift experiments alongside LibriSpeech.

Getting the data

Obtain from the dataset homepage.

Suggested processing

A recommended VoxKitchen pipeline ships in the repository at voxkitchen/templates/pipelines/asr-training-data.yaml — run it with vkit docker run.