Skip to content

VoxLingua107

~6,628 h across 107 languages of short segments automatically extracted from YouTube and labelled by video metadata, for spoken language ID.

Recommendation

The go-to dataset for training and evaluating spoken language identification across many languages. Best for language-ID/embedding models rather than ASR or TTS, since labels are automatic (~98% accurate) and per-language hours vary widely. Use the human-validated dev set (33 languages) for evaluation.

Getting the data

Obtain from the dataset homepage.

CC BY 4.0 distribution, but copyright remains with the original video owners.