VoxLingua107
~6,628 h across 107 languages of short segments automatically extracted from YouTube and labelled by video metadata, for spoken language ID.
- Task: multilingual
- Languages: multi
- Hours: 6628
- Domain: youtube
- License: CC BY 4.0
- Homepage: https://cs.taltech.ee/staff/tanel.alumae/data/voxlingua107/
- Paper: https://arxiv.org/abs/2011.12998
Recommendation
The go-to dataset for training and evaluating spoken language identification across many languages. Best for language-ID/embedding models rather than ASR or TTS, since labels are automatic (~98% accurate) and per-language hours vary widely. Use the human-validated dev set (33 languages) for evaluation.
Getting the data
Obtain from the dataset homepage.
CC BY 4.0 distribution, but copyright remains with the original video owners.