Dataset Catalog

VoxKitchen only curates and links publicly available dataset information — it does not host or redistribute data. You are responsible for each dataset's license and for obtaining the data. The recommendations below are guidance to help you decide.

Dataset	Task	Languages	Hours	License	Access
AISHELL-1	asr	zh	170	CC BY-NC-ND 4.0	recipe
AISHELL-2	asr	zh	1000	see source terms	manual
AISHELL-3	tts	zh	85	CC BY-NC-ND 4.0	recipe
AISHELL-4	asr, speaker	zh	120	CC BY-SA 4.0	manual
AMI Meeting Corpus	asr, speaker	en	100	CC BY 4.0	manual
AVSpeech	asr, speaker	multi	4700	see source terms	manual
CN-Celeb	speaker	zh	1200	see source terms	recipe
Common Voice	asr, multilingual	multi	—	CC0 1.0	recipe
CREMA-D	emotion, speaker	en	—	ODbL 1.0 (database) + DbCL 1.0 (contents)	manual
CSS10	tts	de, el, es, fi, fr, hu, ja, nl, ru, zh	99	see source terms	manual
DAPS (Device and Produced Speech)	tts	en	4.5	CC BY-NC 4.0	manual
DiPCo (Dinner Party Corpus)	asr, speaker	en	—	CDLA-Permissive-1.0	manual
Earnings-21	asr	en	39	CC BY-SA 4.0	manual
Earnings-22	asr	en	119	CC BY-SA 4.0	manual
Emilia	tts, multilingual	multi	—	see source terms	manual
Emotional Speech Database (ESD)	tts, emotion	en, zh	29	see source terms	manual
Expresso	tts, emotion	en	40	CC BY-NC 4.0	manual
FLEURS	asr, multilingual	multi	—	CC BY 4.0	recipe
GigaSpeech	asr	en	10000	see source terms	manual
GigaSpeech 2	asr, multilingual	th, id, vi	30000	see source terms	manual
Golos	asr	ru	1240	see source terms	manual
Hi-Fi Multi-Speaker English TTS (Hi-Fi TTS)	tts	en	292	CC BY 4.0	recipe
IEMOCAP	emotion, speaker	en	12	see source terms	manual
JSUT (Japanese speech corpus of Saruwatari-lab, U-Tokyo)	tts	ja	10	see source terms	manual
KeSpeech	asr, multilingual	zh	1542	see source terms	manual
KsponSpeech	asr	ko	969	see source terms	manual
Libri-Light	asr	en	60000	public domain (LibriVox)	manual
LibriSpeech	asr	en	960	CC BY 4.0	recipe
LibriTTS	tts	en	585	CC BY 4.0	recipe
LibriTTS-R	tts	en	585	CC BY 4.0	recipe
LJSpeech	tts	en	24	Public Domain	recipe
MagicData-RAMC (Rich Annotated Mandarin Conversational)	asr, speaker	zh	180	see source terms	manual
MELD (Multimodal EmotionLines Dataset)	emotion, speaker	en	13	GPL-3.0	manual
MGB-2 Challenge (Arabic Multi-Dialect Broadcast Media Recognition)	asr, multilingual	ar	1200	see source terms	manual
Multilingual LibriSpeech	asr, multilingual	multi	50000	CC BY 4.0	manual
MSP-IMPROV	emotion, speaker	en	—	see source terms	manual
MSP-Podcast	emotion, speaker	en	—	see source terms	manual
MUSAN	augmentation	en	109	CC BY 4.0	recipe
MyST Children's Conversational Speech	asr	en	470	see source terms	manual
IMDA National Speech Corpus (NSC)	asr	en	3000	see source terms	manual
Opencpop	tts	zh	5.2	CC BY-NC-ND 4.0	manual
People's Speech	asr	en	30000	CC BY-SA 4.0	manual
RAVDESS	emotion, speaker	en	—	CC BY-NC-SA 4.0	manual
ReazonSpeech	asr	ja	35000	CDLA-Sharing-1.0	manual
SEAME (Mandarin-English Code-Switching Speech Corpus)	asr, multilingual	multi	192	see source terms	manual
Shrutilipi	asr, multilingual	multi	6400	CC BY 4.0	manual
SLUE (Spoken Language Understanding Evaluation)	asr	en	27.3	see source terms	manual
SPGISpeech	asr	en	5000	see source terms	manual
Switchboard-1 Release 2	asr, speaker	en	260	see source terms	manual
TED-LIUM 3	asr	en	452	CC BY-NC-ND 3.0	manual
THCHS-30 (Tsinghua Chinese 30-hour Database)	asr	zh	30	Apache-2.0	recipe
Thorsten-Voice (German Neutral TTS)	tts	de	23	CC0-1.0	recipe
TIMIT Acoustic-Phonetic Continuous Speech Corpus	asr	en	5	see source terms	manual
VCTK	tts, speaker	en	44	CC BY 4.0	manual
VoxCeleb1	speaker	multi	352	CC BY-SA 4.0	manual
VoxCeleb2	speaker	multi	2442	see source terms	manual
VoxForge	asr, multilingual	multi	—	GNU GPL	manual
VoxLingua107	multilingual	multi	6628	CC BY 4.0	manual
VoxPopuli	asr, multilingual	multi	—	CC0	manual
WenetSpeech	asr	zh	10000	see source terms	manual

Browse by task

asr

AISHELL-1 — 170-hour open Mandarin speech corpus recorded in clean studio conditions; the standard Chinese ASR benchmark.
AISHELL-2 — 1000 hours of clean Mandarin read-speech from ~1991 speakers covering entertainment, finance, technology, sports, and place-of-interest commands, recorded over iOS/Android/microphone channels.
AISHELL-4 — Real-recorded Mandarin conference-meeting corpus (8-channel circular mic array), 211 sessions with 4-8 speakers each, annotated for transcription and speaker activity.
AMI Meeting Corpus — ~100 h of recorded English meetings with synchronized audio, video, and rich annotations including transcripts and speaker labels.
AVSpeech — A large-scale audio-visual dataset of ~4700 hours of 3-10 second clips drawn from ~290k YouTube videos, each segment featuring a single visible speaker with clean speech, released for the "Looking to Listen at the Cocktail Party" speech-separation work.
Common Voice — Mozilla's crowd-sourced multilingual ASR corpus covering 100+ languages; size, quality, and demographics vary widely by language.
DiPCo (Dinner Party Corpus) — English far-field conversational corpus of 10 dinner-party sessions (4 participants each, 15-45 minutes per session) recorded with one close-talk microphone plus five 7-mic far-field array devices, designed for noise-robust distant ASR and diarization.
Earnings-21 — 39 hours of 44 English-language earnings calls from 2020 across nine financial sectors, professionally transcribed by Rev.com for benchmarking ASR on named-entity-dense speech.
Earnings-22 — 119 h benchmark of real-world English corporate earnings calls featuring diverse global accents across many countries.
FLEURS — Few-shot Learning Evaluation of Universal Representations of Speech — standardised ASR/LID evaluation set covering 102 languages derived from the FLoRes-200 text corpus.
GigaSpeech — 10,000-hour multi-domain English ASR corpus spanning audiobooks, podcasts, and YouTube.
GigaSpeech 2 — Large-scale multi-domain ASR for low-resource Southeast Asian languages (Thai, Indonesian, Vietnamese), built by automated YouTube crawling and transcription (~30k h raw, ~22k h refined).
Golos — ~1,240 h of manually annotated open Russian speech split between crowd-sourced (~1,106 h) and farfield/smart-device (~134 h) recordings.
KeSpeech — 1,542 h from 27,237 speakers across 34 cities, covering standard Mandarin and its 8 subdialects with transcription, speaker, and subdialect labels.
KsponSpeech — ~969 h of Korean spontaneous open-domain dialogue from ~2,000 native speakers, with dual orthographic + pronunciation transcription.
Libri-Light — ~60k h of unlabelled English read speech from LibriVox audiobooks, with small labelled subsets (10h, 1h, 10min) for limited-supervision ASR.
LibriSpeech — Read English audiobooks; the standard English ASR benchmark.
MagicData-RAMC (Rich Annotated Mandarin Conversational) — 180 hours of Mandarin two-party conversational telephone-style speech from 663 speakers across Chinese accent regions, with speaker-turn and topic annotations spanning daily-life to technology topics.
MGB-2 Challenge (Arabic Multi-Dialect Broadcast Media Recognition) — 1200 hours of lightly supervised Arabic broadcast speech from 19 Al Jazeera Arabic TV programmes (2005-2015) — conversations, interviews, reports — with multi-dialect coverage.
Multilingual LibriSpeech — 50,000-hour multilingual audiobook ASR corpus derived from LibriVox recordings covering 8 languages (English, German, Dutch, French, Spanish, Italian, Portuguese, Polish).
MyST Children's Conversational Speech — ~470 hours of English conversational speech from 1371 students in grades 3-5 interacting with a virtual science tutor across eight FOSS-curriculum science topics, produced by Boulder Learning.
IMDA National Speech Corpus (NSC) — Large-scale Singapore-English speech corpus from IMDA — ~2000 hours of orthographically transcribed read speech plus ~1000 hours of conversational speech, designed for ASR research on Singapore-accented English.
People's Speech — 30,000-hour English ASR corpus assembled from diverse internet sources including radio broadcasts, court hearings, and conferences.
ReazonSpeech — ~35,000 h open Japanese speech corpus collected from terrestrial TV broadcast streams with aligned Japanese transcriptions.
SEAME (Mandarin-English Code-Switching Speech Corpus) — ~192 hours of spontaneous Mandarin-English code-switching conversations and interviews from 156 Singaporean and Malaysian speakers on everyday topics.
Shrutilipi — A 6400+ hour labelled ASR corpus across 12 Indian languages mined from All India Radio news bulletins by AI4Bharat, with document-level audio-text alignment.
SLUE (Spoken Language Understanding Evaluation) — English SLU benchmark on natural (not read) speech — Phase-1 adds ASR, named-entity recognition, and sentiment annotations over subsets of VoxPopuli and VoxCeleb; Phase-2 adds dialog act classification, QA, summarization, and named-entity localization.
SPGISpeech — 5,000 h of professionally transcribed English company earnings-call audio, fully formatted with punctuation and capitalization.
Switchboard-1 Release 2 — ~2,400 two-sided spontaneous English telephone conversations among 543 US speakers (~260 h), separated into two channels.
TED-LIUM 3 — 452-hour English ASR corpus of TED talks with manual and automatic transcriptions; suitable for lecture/talk domain ASR research.
THCHS-30 (Tsinghua Chinese 30-hour Database) — A 30-hour Mandarin read-speech corpus from CSLT Tsinghua, 16 kHz, with word/syllable/phone-level transcriptions and 50 speakers in a quiet office.
TIMIT Acoustic-Phonetic Continuous Speech Corpus — 630 American English speakers across 8 dialect regions, each reading 10 phonetically rich sentences, with time-aligned phonetic and word transcriptions.
VoxForge — Community-contributed crowdsourced corpus of transcribed read speech collected to build free, open acoustic models for open-source ASR engines.
VoxPopuli — Multilingual corpus from 2009-2020 European Parliament recordings: a large unlabelled set across 23 languages plus transcribed speech and aligned interpretations.
WenetSpeech — 10,000-hour large-scale Mandarin ASR corpus collected from YouTube and podcasts with automatic labelling.

augmentation

MUSAN — 109-hour corpus of music, speech, and environmental noise designed for data augmentation in speech and speaker recognition experiments.

emotion

CREMA-D — 7,442 acted audio-visual emotional clips from 91 demographically diverse actors speaking 12 sentences in 6 emotions at 4 intensity levels.
Emotional Speech Database (ESD) — >29 h of parallel emotional speech from 20 speakers (10 English, 10 Mandarin), each reading 350 parallel utterances across 5 emotions.
Expresso — High-quality multi-speaker English expressive speech at 48 kHz (11 h read + 30 h improvised) across many spontaneous expressive styles, for expressive speech resynthesis.
IEMOCAP — ~12 h of acted audio-visual dyadic interactions from 10 actors (scripted and improvised), with categorical and dimensional (valence/activation/ dominance) emotion labels.
MELD (Multimodal EmotionLines Dataset) — Multimodal (audio, video, text) emotion recognition corpus of ~13k utterances from ~1.4k multi-party dialogues sampled from the Friends TV series, labelled with seven emotions and three-way sentiment.
MSP-IMPROV — Acted dyadic emotional speech corpus from UT Dallas with 12 actors across six dyad sessions producing 8438 speaking turns (652 target sentences) labelled for happiness, sadness, anger, and neutral.
MSP-Podcast — Large-scale naturalistic emotional speech mined from Creative-Commons podcasts, multi-rater annotated with categorical emotions and valence/ arousal/dominance attributes.
RAVDESS — Acted emotional speech and song from 24 professional actors across 8 emotions at two intensity levels (1,440 speech audio files).

multilingual

Common Voice — Mozilla's crowd-sourced multilingual ASR corpus covering 100+ languages; size, quality, and demographics vary widely by language.
Emilia — Large-scale multilingual in-the-wild speech dataset designed for expressive and diverse TTS training, covering 6 languages.
FLEURS — Few-shot Learning Evaluation of Universal Representations of Speech — standardised ASR/LID evaluation set covering 102 languages derived from the FLoRes-200 text corpus.
GigaSpeech 2 — Large-scale multi-domain ASR for low-resource Southeast Asian languages (Thai, Indonesian, Vietnamese), built by automated YouTube crawling and transcription (~30k h raw, ~22k h refined).
KeSpeech — 1,542 h from 27,237 speakers across 34 cities, covering standard Mandarin and its 8 subdialects with transcription, speaker, and subdialect labels.
MGB-2 Challenge (Arabic Multi-Dialect Broadcast Media Recognition) — 1200 hours of lightly supervised Arabic broadcast speech from 19 Al Jazeera Arabic TV programmes (2005-2015) — conversations, interviews, reports — with multi-dialect coverage.
Multilingual LibriSpeech — 50,000-hour multilingual audiobook ASR corpus derived from LibriVox recordings covering 8 languages (English, German, Dutch, French, Spanish, Italian, Portuguese, Polish).
SEAME (Mandarin-English Code-Switching Speech Corpus) — ~192 hours of spontaneous Mandarin-English code-switching conversations and interviews from 156 Singaporean and Malaysian speakers on everyday topics.
Shrutilipi — A 6400+ hour labelled ASR corpus across 12 Indian languages mined from All India Radio news bulletins by AI4Bharat, with document-level audio-text alignment.
VoxForge — Community-contributed crowdsourced corpus of transcribed read speech collected to build free, open acoustic models for open-source ASR engines.
VoxLingua107 — ~6,628 h across 107 languages of short segments automatically extracted from YouTube and labelled by video metadata, for spoken language ID.
VoxPopuli — Multilingual corpus from 2009-2020 European Parliament recordings: a large unlabelled set across 23 languages plus transcribed speech and aligned interpretations.

speaker

AISHELL-4 — Real-recorded Mandarin conference-meeting corpus (8-channel circular mic array), 211 sessions with 4-8 speakers each, annotated for transcription and speaker activity.
AMI Meeting Corpus — ~100 h of recorded English meetings with synchronized audio, video, and rich annotations including transcripts and speaker labels.
AVSpeech — A large-scale audio-visual dataset of ~4700 hours of 3-10 second clips drawn from ~290k YouTube videos, each segment featuring a single visible speaker with clean speech, released for the "Looking to Listen at the Cocktail Party" speech-separation work.
CN-Celeb — 1,200-hour multi-genre Mandarin speaker recognition corpus spanning 11 real-world scenarios collected from Chinese celebrities.
CREMA-D — 7,442 acted audio-visual emotional clips from 91 demographically diverse actors speaking 12 sentences in 6 emotions at 4 intensity levels.
DiPCo (Dinner Party Corpus) — English far-field conversational corpus of 10 dinner-party sessions (4 participants each, 15-45 minutes per session) recorded with one close-talk microphone plus five 7-mic far-field array devices, designed for noise-robust distant ASR and diarization.
IEMOCAP — ~12 h of acted audio-visual dyadic interactions from 10 actors (scripted and improvised), with categorical and dimensional (valence/activation/ dominance) emotion labels.
MagicData-RAMC (Rich Annotated Mandarin Conversational) — 180 hours of Mandarin two-party conversational telephone-style speech from 663 speakers across Chinese accent regions, with speaker-turn and topic annotations spanning daily-life to technology topics.
MELD (Multimodal EmotionLines Dataset) — Multimodal (audio, video, text) emotion recognition corpus of ~13k utterances from ~1.4k multi-party dialogues sampled from the Friends TV series, labelled with seven emotions and three-way sentiment.
MSP-IMPROV — Acted dyadic emotional speech corpus from UT Dallas with 12 actors across six dyad sessions producing 8438 speaking turns (652 target sentences) labelled for happiness, sadness, anger, and neutral.
MSP-Podcast — Large-scale naturalistic emotional speech mined from Creative-Commons podcasts, multi-rater annotated with categorical emotions and valence/ arousal/dominance attributes.
RAVDESS — Acted emotional speech and song from 24 professional actors across 8 emotions at two intensity levels (1,440 speech audio files).
Switchboard-1 Release 2 — ~2,400 two-sided spontaneous English telephone conversations among 543 US speakers (~260 h), separated into two channels.
VCTK — 44-hour English multi-speaker corpus with 110 speakers covering a wide range of UK and US accents; widely used for multi-speaker TTS and speaker adaptation research.
VoxCeleb1 — Speaker identification/verification corpus of 153,516 utterances from 1251 celebrities extracted from YouTube interview videos.
VoxCeleb2 — 2,442-hour large-scale speaker recognition corpus with 6,112 celebrities collected from YouTube across many languages.

tts

AISHELL-3 — 85-hour multi-speaker Mandarin TTS corpus with 218 speakers in clean recording conditions; the standard Chinese multi-speaker TTS baseline.
CSS10 — Single-speaker speech datasets for 10 languages built from aligned public-domain LibriVox clips, intended for TTS.
DAPS (Device and Produced Speech) — Professional studio-quality speech with time-aligned recordings of the same speech captured on consumer devices (tablet, smartphone) in real-world environments; 20 speakers.
Emilia — Large-scale multilingual in-the-wild speech dataset designed for expressive and diverse TTS training, covering 6 languages.
Emotional Speech Database (ESD) — >29 h of parallel emotional speech from 20 speakers (10 English, 10 Mandarin), each reading 350 parallel utterances across 5 emotions.
Expresso — High-quality multi-speaker English expressive speech at 48 kHz (11 h read + 30 h improvised) across many spontaneous expressive styles, for expressive speech resynthesis.
Hi-Fi Multi-Speaker English TTS (Hi-Fi TTS) — ~291.6 h high-quality English multi-speaker TTS from 10 LibriVox speakers (>=17 h each), 44.1 kHz, with Project Gutenberg text.
JSUT (Japanese speech corpus of Saruwatari-lab, U-Tokyo) — A ~10-hour single-speaker Japanese read-speech corpus designed for end-to-end TTS, covering the main pronunciations of daily-use Japanese characters.
LibriTTS — High-fidelity (24 kHz) read English audiobooks derived from LibriSpeech, with normalised transcriptions; the standard baseline for English TTS.
LibriTTS-R — A sound-quality-restored version of LibriTTS — 585 hours of 24 kHz English read speech from 2456 speakers, identical samples/texts to LibriTTS but enhanced via Google's Miipher speech restoration model.
LJSpeech — Single-speaker English TTS corpus (24 h, 13,100 clips) recorded from LibriVox readings. Universally used as a single-speaker TTS baseline.
Opencpop — High-quality Mandarin singing-voice synthesis corpus of 100 popular Chinese pop songs (3756 utterances) sung by a single female professional vocalist, 44.1 kHz, with phoneme/note boundary and pitch annotations.
Thorsten-Voice (German Neutral TTS) — A free single-speaker German read-speech corpus (~22,668 phrases, 22.05 kHz mono) recorded by Thorsten Müller for open TTS training, with neutral and emotional variants released over multiple years.
VCTK — 44-hour English multi-speaker corpus with 110 speakers covering a wide range of UK and US accents; widely used for multi-speaker TTS and speaker adaptation research.