VoxLingua107

~6,628 h across 107 languages of short segments automatically extracted from YouTube and labelled by video metadata, for spoken language ID.

Task: multilingual
Languages: multi
Hours: 6628
Domain: youtube
License: CC BY 4.0
Homepage: https://cs.taltech.ee/staff/tanel.alumae/data/voxlingua107/
Paper: https://arxiv.org/abs/2011.12998

Recommendation

The go-to dataset for training and evaluating spoken language identification across many languages. Best for language-ID/embedding models rather than ASR or TTS, since labels are automatic (~98% accurate) and per-language hours vary widely. Use the human-validated dev set (33 languages) for evaluation.

Getting the data

Obtain from the dataset homepage.

CC BY 4.0 distribution, but copyright remains with the original video owners.