Dataset Catalog
VoxKitchen only curates and links publicly available dataset information — it does not host or redistribute data. You are responsible for each dataset's license and for obtaining the data. The recommendations below are guidance to help you decide.
| Dataset | Task | Languages | Hours | License | Access |
|---|---|---|---|---|---|
| AISHELL-1 | asr | zh | 170 | CC BY-NC-ND 4.0 | recipe |
| AISHELL-2 | asr | zh | 1000 | see source terms | manual |
| AISHELL-3 | tts | zh | 85 | CC BY-NC-ND 4.0 | recipe |
| AISHELL-4 | asr, speaker | zh | 120 | CC BY-SA 4.0 | manual |
| AMI Meeting Corpus | asr, speaker | en | 100 | CC BY 4.0 | manual |
| AVSpeech | asr, speaker | multi | 4700 | see source terms | manual |
| CN-Celeb | speaker | zh | 1200 | see source terms | recipe |
| Common Voice | asr, multilingual | multi | — | CC0 1.0 | recipe |
| CREMA-D | emotion, speaker | en | — | ODbL 1.0 (database) + DbCL 1.0 (contents) | manual |
| CSS10 | tts | de, el, es, fi, fr, hu, ja, nl, ru, zh | 99 | see source terms | manual |
| DAPS (Device and Produced Speech) | tts | en | 4.5 | CC BY-NC 4.0 | manual |
| DiPCo (Dinner Party Corpus) | asr, speaker | en | — | CDLA-Permissive-1.0 | manual |
| Earnings-21 | asr | en | 39 | CC BY-SA 4.0 | manual |
| Earnings-22 | asr | en | 119 | CC BY-SA 4.0 | manual |
| Emilia | tts, multilingual | multi | — | see source terms | manual |
| Emotional Speech Database (ESD) | tts, emotion | en, zh | 29 | see source terms | manual |
| Expresso | tts, emotion | en | 40 | CC BY-NC 4.0 | manual |
| FLEURS | asr, multilingual | multi | — | CC BY 4.0 | recipe |
| GigaSpeech | asr | en | 10000 | see source terms | manual |
| GigaSpeech 2 | asr, multilingual | th, id, vi | 30000 | see source terms | manual |
| Golos | asr | ru | 1240 | see source terms | manual |
| Hi-Fi Multi-Speaker English TTS (Hi-Fi TTS) | tts | en | 292 | CC BY 4.0 | recipe |
| IEMOCAP | emotion, speaker | en | 12 | see source terms | manual |
| JSUT (Japanese speech corpus of Saruwatari-lab, U-Tokyo) | tts | ja | 10 | see source terms | manual |
| KeSpeech | asr, multilingual | zh | 1542 | see source terms | manual |
| KsponSpeech | asr | ko | 969 | see source terms | manual |
| Libri-Light | asr | en | 60000 | public domain (LibriVox) | manual |
| LibriSpeech | asr | en | 960 | CC BY 4.0 | recipe |
| LibriTTS | tts | en | 585 | CC BY 4.0 | recipe |
| LibriTTS-R | tts | en | 585 | CC BY 4.0 | recipe |
| LJSpeech | tts | en | 24 | Public Domain | recipe |
| MagicData-RAMC (Rich Annotated Mandarin Conversational) | asr, speaker | zh | 180 | see source terms | manual |
| MELD (Multimodal EmotionLines Dataset) | emotion, speaker | en | 13 | GPL-3.0 | manual |
| MGB-2 Challenge (Arabic Multi-Dialect Broadcast Media Recognition) | asr, multilingual | ar | 1200 | see source terms | manual |
| Multilingual LibriSpeech | asr, multilingual | multi | 50000 | CC BY 4.0 | manual |
| MSP-IMPROV | emotion, speaker | en | — | see source terms | manual |
| MSP-Podcast | emotion, speaker | en | — | see source terms | manual |
| MUSAN | augmentation | en | 109 | CC BY 4.0 | recipe |
| MyST Children's Conversational Speech | asr | en | 470 | see source terms | manual |
| IMDA National Speech Corpus (NSC) | asr | en | 3000 | see source terms | manual |
| Opencpop | tts | zh | 5.2 | CC BY-NC-ND 4.0 | manual |
| People's Speech | asr | en | 30000 | CC BY-SA 4.0 | manual |
| RAVDESS | emotion, speaker | en | — | CC BY-NC-SA 4.0 | manual |
| ReazonSpeech | asr | ja | 35000 | CDLA-Sharing-1.0 | manual |
| SEAME (Mandarin-English Code-Switching Speech Corpus) | asr, multilingual | multi | 192 | see source terms | manual |
| Shrutilipi | asr, multilingual | multi | 6400 | CC BY 4.0 | manual |
| SLUE (Spoken Language Understanding Evaluation) | asr | en | 27.3 | see source terms | manual |
| SPGISpeech | asr | en | 5000 | see source terms | manual |
| Switchboard-1 Release 2 | asr, speaker | en | 260 | see source terms | manual |
| TED-LIUM 3 | asr | en | 452 | CC BY-NC-ND 3.0 | manual |
| THCHS-30 (Tsinghua Chinese 30-hour Database) | asr | zh | 30 | Apache-2.0 | recipe |
| Thorsten-Voice (German Neutral TTS) | tts | de | 23 | CC0-1.0 | recipe |
| TIMIT Acoustic-Phonetic Continuous Speech Corpus | asr | en | 5 | see source terms | manual |
| VCTK | tts, speaker | en | 44 | CC BY 4.0 | manual |
| VoxCeleb1 | speaker | multi | 352 | CC BY-SA 4.0 | manual |
| VoxCeleb2 | speaker | multi | 2442 | see source terms | manual |
| VoxForge | asr, multilingual | multi | — | GNU GPL | manual |
| VoxLingua107 | multilingual | multi | 6628 | CC BY 4.0 | manual |
| VoxPopuli | asr, multilingual | multi | — | CC0 | manual |
| WenetSpeech | asr | zh | 10000 | see source terms | manual |
Browse by task
asr
- AISHELL-1 — 170-hour open Mandarin speech corpus recorded in clean studio conditions; the standard Chinese ASR benchmark.
- AISHELL-2 — 1000 hours of clean Mandarin read-speech from ~1991 speakers covering entertainment, finance, technology, sports, and place-of-interest commands, recorded over iOS/Android/microphone channels.
- AISHELL-4 — Real-recorded Mandarin conference-meeting corpus (8-channel circular mic array), 211 sessions with 4-8 speakers each, annotated for transcription and speaker activity.
- AMI Meeting Corpus — ~100 h of recorded English meetings with synchronized audio, video, and rich annotations including transcripts and speaker labels.
- AVSpeech — A large-scale audio-visual dataset of ~4700 hours of 3-10 second clips drawn from ~290k YouTube videos, each segment featuring a single visible speaker with clean speech, released for the "Looking to Listen at the Cocktail Party" speech-separation work.
- Common Voice — Mozilla's crowd-sourced multilingual ASR corpus covering 100+ languages; size, quality, and demographics vary widely by language.
- DiPCo (Dinner Party Corpus) — English far-field conversational corpus of 10 dinner-party sessions (4 participants each, 15-45 minutes per session) recorded with one close-talk microphone plus five 7-mic far-field array devices, designed for noise-robust distant ASR and diarization.
- Earnings-21 — 39 hours of 44 English-language earnings calls from 2020 across nine financial sectors, professionally transcribed by Rev.com for benchmarking ASR on named-entity-dense speech.
- Earnings-22 — 119 h benchmark of real-world English corporate earnings calls featuring diverse global accents across many countries.
- FLEURS — Few-shot Learning Evaluation of Universal Representations of Speech — standardised ASR/LID evaluation set covering 102 languages derived from the FLoRes-200 text corpus.
- GigaSpeech — 10,000-hour multi-domain English ASR corpus spanning audiobooks, podcasts, and YouTube.
- GigaSpeech 2 — Large-scale multi-domain ASR for low-resource Southeast Asian languages (Thai, Indonesian, Vietnamese), built by automated YouTube crawling and transcription (~30k h raw, ~22k h refined).
- Golos — ~1,240 h of manually annotated open Russian speech split between crowd-sourced (~1,106 h) and farfield/smart-device (~134 h) recordings.
- KeSpeech — 1,542 h from 27,237 speakers across 34 cities, covering standard Mandarin and its 8 subdialects with transcription, speaker, and subdialect labels.
- KsponSpeech — ~969 h of Korean spontaneous open-domain dialogue from ~2,000 native speakers, with dual orthographic + pronunciation transcription.
- Libri-Light — ~60k h of unlabelled English read speech from LibriVox audiobooks, with small labelled subsets (10h, 1h, 10min) for limited-supervision ASR.
- LibriSpeech — Read English audiobooks; the standard English ASR benchmark.
- MagicData-RAMC (Rich Annotated Mandarin Conversational) — 180 hours of Mandarin two-party conversational telephone-style speech from 663 speakers across Chinese accent regions, with speaker-turn and topic annotations spanning daily-life to technology topics.
- MGB-2 Challenge (Arabic Multi-Dialect Broadcast Media Recognition) — 1200 hours of lightly supervised Arabic broadcast speech from 19 Al Jazeera Arabic TV programmes (2005-2015) — conversations, interviews, reports — with multi-dialect coverage.
- Multilingual LibriSpeech — 50,000-hour multilingual audiobook ASR corpus derived from LibriVox recordings covering 8 languages (English, German, Dutch, French, Spanish, Italian, Portuguese, Polish).
- MyST Children's Conversational Speech — ~470 hours of English conversational speech from 1371 students in grades 3-5 interacting with a virtual science tutor across eight FOSS-curriculum science topics, produced by Boulder Learning.
- IMDA National Speech Corpus (NSC) — Large-scale Singapore-English speech corpus from IMDA — ~2000 hours of orthographically transcribed read speech plus ~1000 hours of conversational speech, designed for ASR research on Singapore-accented English.
- People's Speech — 30,000-hour English ASR corpus assembled from diverse internet sources including radio broadcasts, court hearings, and conferences.
- ReazonSpeech — ~35,000 h open Japanese speech corpus collected from terrestrial TV broadcast streams with aligned Japanese transcriptions.
- SEAME (Mandarin-English Code-Switching Speech Corpus) — ~192 hours of spontaneous Mandarin-English code-switching conversations and interviews from 156 Singaporean and Malaysian speakers on everyday topics.
- Shrutilipi — A 6400+ hour labelled ASR corpus across 12 Indian languages mined from All India Radio news bulletins by AI4Bharat, with document-level audio-text alignment.
- SLUE (Spoken Language Understanding Evaluation) — English SLU benchmark on natural (not read) speech — Phase-1 adds ASR, named-entity recognition, and sentiment annotations over subsets of VoxPopuli and VoxCeleb; Phase-2 adds dialog act classification, QA, summarization, and named-entity localization.
- SPGISpeech — 5,000 h of professionally transcribed English company earnings-call audio, fully formatted with punctuation and capitalization.
- Switchboard-1 Release 2 — ~2,400 two-sided spontaneous English telephone conversations among 543 US speakers (~260 h), separated into two channels.
- TED-LIUM 3 — 452-hour English ASR corpus of TED talks with manual and automatic transcriptions; suitable for lecture/talk domain ASR research.
- THCHS-30 (Tsinghua Chinese 30-hour Database) — A 30-hour Mandarin read-speech corpus from CSLT Tsinghua, 16 kHz, with word/syllable/phone-level transcriptions and 50 speakers in a quiet office.
- TIMIT Acoustic-Phonetic Continuous Speech Corpus — 630 American English speakers across 8 dialect regions, each reading 10 phonetically rich sentences, with time-aligned phonetic and word transcriptions.
- VoxForge — Community-contributed crowdsourced corpus of transcribed read speech collected to build free, open acoustic models for open-source ASR engines.
- VoxPopuli — Multilingual corpus from 2009-2020 European Parliament recordings: a large unlabelled set across 23 languages plus transcribed speech and aligned interpretations.
- WenetSpeech — 10,000-hour large-scale Mandarin ASR corpus collected from YouTube and podcasts with automatic labelling.
augmentation
- MUSAN — 109-hour corpus of music, speech, and environmental noise designed for data augmentation in speech and speaker recognition experiments.
emotion
- CREMA-D — 7,442 acted audio-visual emotional clips from 91 demographically diverse actors speaking 12 sentences in 6 emotions at 4 intensity levels.
- Emotional Speech Database (ESD) — >29 h of parallel emotional speech from 20 speakers (10 English, 10 Mandarin), each reading 350 parallel utterances across 5 emotions.
- Expresso — High-quality multi-speaker English expressive speech at 48 kHz (11 h read + 30 h improvised) across many spontaneous expressive styles, for expressive speech resynthesis.
- IEMOCAP — ~12 h of acted audio-visual dyadic interactions from 10 actors (scripted and improvised), with categorical and dimensional (valence/activation/ dominance) emotion labels.
- MELD (Multimodal EmotionLines Dataset) — Multimodal (audio, video, text) emotion recognition corpus of ~13k utterances from ~1.4k multi-party dialogues sampled from the Friends TV series, labelled with seven emotions and three-way sentiment.
- MSP-IMPROV — Acted dyadic emotional speech corpus from UT Dallas with 12 actors across six dyad sessions producing 8438 speaking turns (652 target sentences) labelled for happiness, sadness, anger, and neutral.
- MSP-Podcast — Large-scale naturalistic emotional speech mined from Creative-Commons podcasts, multi-rater annotated with categorical emotions and valence/ arousal/dominance attributes.
- RAVDESS — Acted emotional speech and song from 24 professional actors across 8 emotions at two intensity levels (1,440 speech audio files).
multilingual
- Common Voice — Mozilla's crowd-sourced multilingual ASR corpus covering 100+ languages; size, quality, and demographics vary widely by language.
- Emilia — Large-scale multilingual in-the-wild speech dataset designed for expressive and diverse TTS training, covering 6 languages.
- FLEURS — Few-shot Learning Evaluation of Universal Representations of Speech — standardised ASR/LID evaluation set covering 102 languages derived from the FLoRes-200 text corpus.
- GigaSpeech 2 — Large-scale multi-domain ASR for low-resource Southeast Asian languages (Thai, Indonesian, Vietnamese), built by automated YouTube crawling and transcription (~30k h raw, ~22k h refined).
- KeSpeech — 1,542 h from 27,237 speakers across 34 cities, covering standard Mandarin and its 8 subdialects with transcription, speaker, and subdialect labels.
- MGB-2 Challenge (Arabic Multi-Dialect Broadcast Media Recognition) — 1200 hours of lightly supervised Arabic broadcast speech from 19 Al Jazeera Arabic TV programmes (2005-2015) — conversations, interviews, reports — with multi-dialect coverage.
- Multilingual LibriSpeech — 50,000-hour multilingual audiobook ASR corpus derived from LibriVox recordings covering 8 languages (English, German, Dutch, French, Spanish, Italian, Portuguese, Polish).
- SEAME (Mandarin-English Code-Switching Speech Corpus) — ~192 hours of spontaneous Mandarin-English code-switching conversations and interviews from 156 Singaporean and Malaysian speakers on everyday topics.
- Shrutilipi — A 6400+ hour labelled ASR corpus across 12 Indian languages mined from All India Radio news bulletins by AI4Bharat, with document-level audio-text alignment.
- VoxForge — Community-contributed crowdsourced corpus of transcribed read speech collected to build free, open acoustic models for open-source ASR engines.
- VoxLingua107 — ~6,628 h across 107 languages of short segments automatically extracted from YouTube and labelled by video metadata, for spoken language ID.
- VoxPopuli — Multilingual corpus from 2009-2020 European Parliament recordings: a large unlabelled set across 23 languages plus transcribed speech and aligned interpretations.
speaker
- AISHELL-4 — Real-recorded Mandarin conference-meeting corpus (8-channel circular mic array), 211 sessions with 4-8 speakers each, annotated for transcription and speaker activity.
- AMI Meeting Corpus — ~100 h of recorded English meetings with synchronized audio, video, and rich annotations including transcripts and speaker labels.
- AVSpeech — A large-scale audio-visual dataset of ~4700 hours of 3-10 second clips drawn from ~290k YouTube videos, each segment featuring a single visible speaker with clean speech, released for the "Looking to Listen at the Cocktail Party" speech-separation work.
- CN-Celeb — 1,200-hour multi-genre Mandarin speaker recognition corpus spanning 11 real-world scenarios collected from Chinese celebrities.
- CREMA-D — 7,442 acted audio-visual emotional clips from 91 demographically diverse actors speaking 12 sentences in 6 emotions at 4 intensity levels.
- DiPCo (Dinner Party Corpus) — English far-field conversational corpus of 10 dinner-party sessions (4 participants each, 15-45 minutes per session) recorded with one close-talk microphone plus five 7-mic far-field array devices, designed for noise-robust distant ASR and diarization.
- IEMOCAP — ~12 h of acted audio-visual dyadic interactions from 10 actors (scripted and improvised), with categorical and dimensional (valence/activation/ dominance) emotion labels.
- MagicData-RAMC (Rich Annotated Mandarin Conversational) — 180 hours of Mandarin two-party conversational telephone-style speech from 663 speakers across Chinese accent regions, with speaker-turn and topic annotations spanning daily-life to technology topics.
- MELD (Multimodal EmotionLines Dataset) — Multimodal (audio, video, text) emotion recognition corpus of ~13k utterances from ~1.4k multi-party dialogues sampled from the Friends TV series, labelled with seven emotions and three-way sentiment.
- MSP-IMPROV — Acted dyadic emotional speech corpus from UT Dallas with 12 actors across six dyad sessions producing 8438 speaking turns (652 target sentences) labelled for happiness, sadness, anger, and neutral.
- MSP-Podcast — Large-scale naturalistic emotional speech mined from Creative-Commons podcasts, multi-rater annotated with categorical emotions and valence/ arousal/dominance attributes.
- RAVDESS — Acted emotional speech and song from 24 professional actors across 8 emotions at two intensity levels (1,440 speech audio files).
- Switchboard-1 Release 2 — ~2,400 two-sided spontaneous English telephone conversations among 543 US speakers (~260 h), separated into two channels.
- VCTK — 44-hour English multi-speaker corpus with 110 speakers covering a wide range of UK and US accents; widely used for multi-speaker TTS and speaker adaptation research.
- VoxCeleb1 — Speaker identification/verification corpus of 153,516 utterances from 1251 celebrities extracted from YouTube interview videos.
- VoxCeleb2 — 2,442-hour large-scale speaker recognition corpus with 6,112 celebrities collected from YouTube across many languages.
tts
- AISHELL-3 — 85-hour multi-speaker Mandarin TTS corpus with 218 speakers in clean recording conditions; the standard Chinese multi-speaker TTS baseline.
- CSS10 — Single-speaker speech datasets for 10 languages built from aligned public-domain LibriVox clips, intended for TTS.
- DAPS (Device and Produced Speech) — Professional studio-quality speech with time-aligned recordings of the same speech captured on consumer devices (tablet, smartphone) in real-world environments; 20 speakers.
- Emilia — Large-scale multilingual in-the-wild speech dataset designed for expressive and diverse TTS training, covering 6 languages.
- Emotional Speech Database (ESD) — >29 h of parallel emotional speech from 20 speakers (10 English, 10 Mandarin), each reading 350 parallel utterances across 5 emotions.
- Expresso — High-quality multi-speaker English expressive speech at 48 kHz (11 h read + 30 h improvised) across many spontaneous expressive styles, for expressive speech resynthesis.
- Hi-Fi Multi-Speaker English TTS (Hi-Fi TTS) — ~291.6 h high-quality English multi-speaker TTS from 10 LibriVox speakers (>=17 h each), 44.1 kHz, with Project Gutenberg text.
- JSUT (Japanese speech corpus of Saruwatari-lab, U-Tokyo) — A ~10-hour single-speaker Japanese read-speech corpus designed for end-to-end TTS, covering the main pronunciations of daily-use Japanese characters.
- LibriTTS — High-fidelity (24 kHz) read English audiobooks derived from LibriSpeech, with normalised transcriptions; the standard baseline for English TTS.
- LibriTTS-R — A sound-quality-restored version of LibriTTS — 585 hours of 24 kHz English read speech from 2456 speakers, identical samples/texts to LibriTTS but enhanced via Google's Miipher speech restoration model.
- LJSpeech — Single-speaker English TTS corpus (24 h, 13,100 clips) recorded from LibriVox readings. Universally used as a single-speaker TTS baseline.
- Opencpop — High-quality Mandarin singing-voice synthesis corpus of 100 popular Chinese pop songs (3756 utterances) sung by a single female professional vocalist, 44.1 kHz, with phoneme/note boundary and pitch annotations.
- Thorsten-Voice (German Neutral TTS) — A free single-speaker German read-speech corpus (~22,668 phrases, 22.05 kHz mono) recorded by Thorsten Müller for open TTS training, with neutral and emotional variants released over multiple years.
- VCTK — 44-hour English multi-speaker corpus with 110 speakers covering a wide range of UK and US accents; widely used for multi-speaker TTS and speaker adaptation research.