SEAME (Mandarin-English Code-Switching Speech Corpus)
~192 hours of spontaneous Mandarin-English code-switching conversations and interviews from 156 Singaporean and Malaysian speakers on everyday topics.
- Task: asr, multilingual
- Languages: multi
- Hours: 192
- Domain: code-switching conversational
- License: see source terms
- Homepage: https://catalog.ldc.upenn.edu/LDC2015S04
Recommendation
The de facto benchmark for Mandarin-English code-switching ASR — pick when you need intra-sentence code-switching with realistic Southeast Asian accents. Modest size by modern standards; accent distribution (Singapore/Malaysia) may not transfer to mainland-China Mandarin or US English.
Getting the data
Obtain from the dataset homepage.
Paid LDC distribution (LDC2015S04); 16 kHz FLAC; UTF-8 transcripts with per-token language labels.
Suggested processing
A recommended VoxKitchen pipeline ships in the repository at voxkitchen/templates/pipelines/asr-training-data.yaml — run it with vkit docker run.