Skip to content

Dataset Catalog

VoxKitchen only curates and links publicly available dataset information — it does not host or redistribute data. You are responsible for each dataset's license and for obtaining the data. The recommendations below are guidance to help you decide.

Dataset Task Languages Hours License Access
AISHELL-1 asr zh 170 CC BY-NC-ND 4.0 recipe
AISHELL-2 asr zh 1000 see source terms manual
AISHELL-3 tts zh 85 CC BY-NC-ND 4.0 recipe
AISHELL-4 asr, speaker zh 120 CC BY-SA 4.0 manual
AMI Meeting Corpus asr, speaker en 100 CC BY 4.0 manual
AVSpeech asr, speaker multi 4700 see source terms manual
CN-Celeb speaker zh 1200 see source terms recipe
Common Voice asr, multilingual multi CC0 1.0 recipe
CREMA-D emotion, speaker en ODbL 1.0 (database) + DbCL 1.0 (contents) manual
CSS10 tts de, el, es, fi, fr, hu, ja, nl, ru, zh 99 see source terms manual
DAPS (Device and Produced Speech) tts en 4.5 CC BY-NC 4.0 manual
DiPCo (Dinner Party Corpus) asr, speaker en CDLA-Permissive-1.0 manual
Earnings-21 asr en 39 CC BY-SA 4.0 manual
Earnings-22 asr en 119 CC BY-SA 4.0 manual
Emilia tts, multilingual multi see source terms manual
Emotional Speech Database (ESD) tts, emotion en, zh 29 see source terms manual
Expresso tts, emotion en 40 CC BY-NC 4.0 manual
FLEURS asr, multilingual multi CC BY 4.0 recipe
GigaSpeech asr en 10000 see source terms manual
GigaSpeech 2 asr, multilingual th, id, vi 30000 see source terms manual
Golos asr ru 1240 see source terms manual
Hi-Fi Multi-Speaker English TTS (Hi-Fi TTS) tts en 292 CC BY 4.0 recipe
IEMOCAP emotion, speaker en 12 see source terms manual
JSUT (Japanese speech corpus of Saruwatari-lab, U-Tokyo) tts ja 10 see source terms manual
KeSpeech asr, multilingual zh 1542 see source terms manual
KsponSpeech asr ko 969 see source terms manual
Libri-Light asr en 60000 public domain (LibriVox) manual
LibriSpeech asr en 960 CC BY 4.0 recipe
LibriTTS tts en 585 CC BY 4.0 recipe
LibriTTS-R tts en 585 CC BY 4.0 recipe
LJSpeech tts en 24 Public Domain recipe
MagicData-RAMC (Rich Annotated Mandarin Conversational) asr, speaker zh 180 see source terms manual
MELD (Multimodal EmotionLines Dataset) emotion, speaker en 13 GPL-3.0 manual
MGB-2 Challenge (Arabic Multi-Dialect Broadcast Media Recognition) asr, multilingual ar 1200 see source terms manual
Multilingual LibriSpeech asr, multilingual multi 50000 CC BY 4.0 manual
MSP-IMPROV emotion, speaker en see source terms manual
MSP-Podcast emotion, speaker en see source terms manual
MUSAN augmentation en 109 CC BY 4.0 recipe
MyST Children's Conversational Speech asr en 470 see source terms manual
IMDA National Speech Corpus (NSC) asr en 3000 see source terms manual
Opencpop tts zh 5.2 CC BY-NC-ND 4.0 manual
People's Speech asr en 30000 CC BY-SA 4.0 manual
RAVDESS emotion, speaker en CC BY-NC-SA 4.0 manual
ReazonSpeech asr ja 35000 CDLA-Sharing-1.0 manual
SEAME (Mandarin-English Code-Switching Speech Corpus) asr, multilingual multi 192 see source terms manual
Shrutilipi asr, multilingual multi 6400 CC BY 4.0 manual
SLUE (Spoken Language Understanding Evaluation) asr en 27.3 see source terms manual
SPGISpeech asr en 5000 see source terms manual
Switchboard-1 Release 2 asr, speaker en 260 see source terms manual
TED-LIUM 3 asr en 452 CC BY-NC-ND 3.0 manual
THCHS-30 (Tsinghua Chinese 30-hour Database) asr zh 30 Apache-2.0 recipe
Thorsten-Voice (German Neutral TTS) tts de 23 CC0-1.0 recipe
TIMIT Acoustic-Phonetic Continuous Speech Corpus asr en 5 see source terms manual
VCTK tts, speaker en 44 CC BY 4.0 manual
VoxCeleb1 speaker multi 352 CC BY-SA 4.0 manual
VoxCeleb2 speaker multi 2442 see source terms manual
VoxForge asr, multilingual multi GNU GPL manual
VoxLingua107 multilingual multi 6628 CC BY 4.0 manual
VoxPopuli asr, multilingual multi CC0 manual
WenetSpeech asr zh 10000 see source terms manual

Browse by task

asr

  • AISHELL-1 — 170-hour open Mandarin speech corpus recorded in clean studio conditions; the standard Chinese ASR benchmark.
  • AISHELL-2 — 1000 hours of clean Mandarin read-speech from ~1991 speakers covering entertainment, finance, technology, sports, and place-of-interest commands, recorded over iOS/Android/microphone channels.
  • AISHELL-4 — Real-recorded Mandarin conference-meeting corpus (8-channel circular mic array), 211 sessions with 4-8 speakers each, annotated for transcription and speaker activity.
  • AMI Meeting Corpus — ~100 h of recorded English meetings with synchronized audio, video, and rich annotations including transcripts and speaker labels.
  • AVSpeech — A large-scale audio-visual dataset of ~4700 hours of 3-10 second clips drawn from ~290k YouTube videos, each segment featuring a single visible speaker with clean speech, released for the "Looking to Listen at the Cocktail Party" speech-separation work.
  • Common Voice — Mozilla's crowd-sourced multilingual ASR corpus covering 100+ languages; size, quality, and demographics vary widely by language.
  • DiPCo (Dinner Party Corpus) — English far-field conversational corpus of 10 dinner-party sessions (4 participants each, 15-45 minutes per session) recorded with one close-talk microphone plus five 7-mic far-field array devices, designed for noise-robust distant ASR and diarization.
  • Earnings-21 — 39 hours of 44 English-language earnings calls from 2020 across nine financial sectors, professionally transcribed by Rev.com for benchmarking ASR on named-entity-dense speech.
  • Earnings-22 — 119 h benchmark of real-world English corporate earnings calls featuring diverse global accents across many countries.
  • FLEURS — Few-shot Learning Evaluation of Universal Representations of Speech — standardised ASR/LID evaluation set covering 102 languages derived from the FLoRes-200 text corpus.
  • GigaSpeech — 10,000-hour multi-domain English ASR corpus spanning audiobooks, podcasts, and YouTube.
  • GigaSpeech 2 — Large-scale multi-domain ASR for low-resource Southeast Asian languages (Thai, Indonesian, Vietnamese), built by automated YouTube crawling and transcription (~30k h raw, ~22k h refined).
  • Golos — ~1,240 h of manually annotated open Russian speech split between crowd-sourced (~1,106 h) and farfield/smart-device (~134 h) recordings.
  • KeSpeech — 1,542 h from 27,237 speakers across 34 cities, covering standard Mandarin and its 8 subdialects with transcription, speaker, and subdialect labels.
  • KsponSpeech — ~969 h of Korean spontaneous open-domain dialogue from ~2,000 native speakers, with dual orthographic + pronunciation transcription.
  • Libri-Light — ~60k h of unlabelled English read speech from LibriVox audiobooks, with small labelled subsets (10h, 1h, 10min) for limited-supervision ASR.
  • LibriSpeech — Read English audiobooks; the standard English ASR benchmark.
  • MagicData-RAMC (Rich Annotated Mandarin Conversational) — 180 hours of Mandarin two-party conversational telephone-style speech from 663 speakers across Chinese accent regions, with speaker-turn and topic annotations spanning daily-life to technology topics.
  • MGB-2 Challenge (Arabic Multi-Dialect Broadcast Media Recognition) — 1200 hours of lightly supervised Arabic broadcast speech from 19 Al Jazeera Arabic TV programmes (2005-2015) — conversations, interviews, reports — with multi-dialect coverage.
  • Multilingual LibriSpeech — 50,000-hour multilingual audiobook ASR corpus derived from LibriVox recordings covering 8 languages (English, German, Dutch, French, Spanish, Italian, Portuguese, Polish).
  • MyST Children's Conversational Speech — ~470 hours of English conversational speech from 1371 students in grades 3-5 interacting with a virtual science tutor across eight FOSS-curriculum science topics, produced by Boulder Learning.
  • IMDA National Speech Corpus (NSC) — Large-scale Singapore-English speech corpus from IMDA — ~2000 hours of orthographically transcribed read speech plus ~1000 hours of conversational speech, designed for ASR research on Singapore-accented English.
  • People's Speech — 30,000-hour English ASR corpus assembled from diverse internet sources including radio broadcasts, court hearings, and conferences.
  • ReazonSpeech — ~35,000 h open Japanese speech corpus collected from terrestrial TV broadcast streams with aligned Japanese transcriptions.
  • SEAME (Mandarin-English Code-Switching Speech Corpus) — ~192 hours of spontaneous Mandarin-English code-switching conversations and interviews from 156 Singaporean and Malaysian speakers on everyday topics.
  • Shrutilipi — A 6400+ hour labelled ASR corpus across 12 Indian languages mined from All India Radio news bulletins by AI4Bharat, with document-level audio-text alignment.
  • SLUE (Spoken Language Understanding Evaluation) — English SLU benchmark on natural (not read) speech — Phase-1 adds ASR, named-entity recognition, and sentiment annotations over subsets of VoxPopuli and VoxCeleb; Phase-2 adds dialog act classification, QA, summarization, and named-entity localization.
  • SPGISpeech — 5,000 h of professionally transcribed English company earnings-call audio, fully formatted with punctuation and capitalization.
  • Switchboard-1 Release 2 — ~2,400 two-sided spontaneous English telephone conversations among 543 US speakers (~260 h), separated into two channels.
  • TED-LIUM 3 — 452-hour English ASR corpus of TED talks with manual and automatic transcriptions; suitable for lecture/talk domain ASR research.
  • THCHS-30 (Tsinghua Chinese 30-hour Database) — A 30-hour Mandarin read-speech corpus from CSLT Tsinghua, 16 kHz, with word/syllable/phone-level transcriptions and 50 speakers in a quiet office.
  • TIMIT Acoustic-Phonetic Continuous Speech Corpus — 630 American English speakers across 8 dialect regions, each reading 10 phonetically rich sentences, with time-aligned phonetic and word transcriptions.
  • VoxForge — Community-contributed crowdsourced corpus of transcribed read speech collected to build free, open acoustic models for open-source ASR engines.
  • VoxPopuli — Multilingual corpus from 2009-2020 European Parliament recordings: a large unlabelled set across 23 languages plus transcribed speech and aligned interpretations.
  • WenetSpeech — 10,000-hour large-scale Mandarin ASR corpus collected from YouTube and podcasts with automatic labelling.

augmentation

  • MUSAN — 109-hour corpus of music, speech, and environmental noise designed for data augmentation in speech and speaker recognition experiments.

emotion

  • CREMA-D — 7,442 acted audio-visual emotional clips from 91 demographically diverse actors speaking 12 sentences in 6 emotions at 4 intensity levels.
  • Emotional Speech Database (ESD) — >29 h of parallel emotional speech from 20 speakers (10 English, 10 Mandarin), each reading 350 parallel utterances across 5 emotions.
  • Expresso — High-quality multi-speaker English expressive speech at 48 kHz (11 h read + 30 h improvised) across many spontaneous expressive styles, for expressive speech resynthesis.
  • IEMOCAP — ~12 h of acted audio-visual dyadic interactions from 10 actors (scripted and improvised), with categorical and dimensional (valence/activation/ dominance) emotion labels.
  • MELD (Multimodal EmotionLines Dataset) — Multimodal (audio, video, text) emotion recognition corpus of ~13k utterances from ~1.4k multi-party dialogues sampled from the Friends TV series, labelled with seven emotions and three-way sentiment.
  • MSP-IMPROV — Acted dyadic emotional speech corpus from UT Dallas with 12 actors across six dyad sessions producing 8438 speaking turns (652 target sentences) labelled for happiness, sadness, anger, and neutral.
  • MSP-Podcast — Large-scale naturalistic emotional speech mined from Creative-Commons podcasts, multi-rater annotated with categorical emotions and valence/ arousal/dominance attributes.
  • RAVDESS — Acted emotional speech and song from 24 professional actors across 8 emotions at two intensity levels (1,440 speech audio files).

multilingual

  • Common Voice — Mozilla's crowd-sourced multilingual ASR corpus covering 100+ languages; size, quality, and demographics vary widely by language.
  • Emilia — Large-scale multilingual in-the-wild speech dataset designed for expressive and diverse TTS training, covering 6 languages.
  • FLEURS — Few-shot Learning Evaluation of Universal Representations of Speech — standardised ASR/LID evaluation set covering 102 languages derived from the FLoRes-200 text corpus.
  • GigaSpeech 2 — Large-scale multi-domain ASR for low-resource Southeast Asian languages (Thai, Indonesian, Vietnamese), built by automated YouTube crawling and transcription (~30k h raw, ~22k h refined).
  • KeSpeech — 1,542 h from 27,237 speakers across 34 cities, covering standard Mandarin and its 8 subdialects with transcription, speaker, and subdialect labels.
  • MGB-2 Challenge (Arabic Multi-Dialect Broadcast Media Recognition) — 1200 hours of lightly supervised Arabic broadcast speech from 19 Al Jazeera Arabic TV programmes (2005-2015) — conversations, interviews, reports — with multi-dialect coverage.
  • Multilingual LibriSpeech — 50,000-hour multilingual audiobook ASR corpus derived from LibriVox recordings covering 8 languages (English, German, Dutch, French, Spanish, Italian, Portuguese, Polish).
  • SEAME (Mandarin-English Code-Switching Speech Corpus) — ~192 hours of spontaneous Mandarin-English code-switching conversations and interviews from 156 Singaporean and Malaysian speakers on everyday topics.
  • Shrutilipi — A 6400+ hour labelled ASR corpus across 12 Indian languages mined from All India Radio news bulletins by AI4Bharat, with document-level audio-text alignment.
  • VoxForge — Community-contributed crowdsourced corpus of transcribed read speech collected to build free, open acoustic models for open-source ASR engines.
  • VoxLingua107 — ~6,628 h across 107 languages of short segments automatically extracted from YouTube and labelled by video metadata, for spoken language ID.
  • VoxPopuli — Multilingual corpus from 2009-2020 European Parliament recordings: a large unlabelled set across 23 languages plus transcribed speech and aligned interpretations.

speaker

  • AISHELL-4 — Real-recorded Mandarin conference-meeting corpus (8-channel circular mic array), 211 sessions with 4-8 speakers each, annotated for transcription and speaker activity.
  • AMI Meeting Corpus — ~100 h of recorded English meetings with synchronized audio, video, and rich annotations including transcripts and speaker labels.
  • AVSpeech — A large-scale audio-visual dataset of ~4700 hours of 3-10 second clips drawn from ~290k YouTube videos, each segment featuring a single visible speaker with clean speech, released for the "Looking to Listen at the Cocktail Party" speech-separation work.
  • CN-Celeb — 1,200-hour multi-genre Mandarin speaker recognition corpus spanning 11 real-world scenarios collected from Chinese celebrities.
  • CREMA-D — 7,442 acted audio-visual emotional clips from 91 demographically diverse actors speaking 12 sentences in 6 emotions at 4 intensity levels.
  • DiPCo (Dinner Party Corpus) — English far-field conversational corpus of 10 dinner-party sessions (4 participants each, 15-45 minutes per session) recorded with one close-talk microphone plus five 7-mic far-field array devices, designed for noise-robust distant ASR and diarization.
  • IEMOCAP — ~12 h of acted audio-visual dyadic interactions from 10 actors (scripted and improvised), with categorical and dimensional (valence/activation/ dominance) emotion labels.
  • MagicData-RAMC (Rich Annotated Mandarin Conversational) — 180 hours of Mandarin two-party conversational telephone-style speech from 663 speakers across Chinese accent regions, with speaker-turn and topic annotations spanning daily-life to technology topics.
  • MELD (Multimodal EmotionLines Dataset) — Multimodal (audio, video, text) emotion recognition corpus of ~13k utterances from ~1.4k multi-party dialogues sampled from the Friends TV series, labelled with seven emotions and three-way sentiment.
  • MSP-IMPROV — Acted dyadic emotional speech corpus from UT Dallas with 12 actors across six dyad sessions producing 8438 speaking turns (652 target sentences) labelled for happiness, sadness, anger, and neutral.
  • MSP-Podcast — Large-scale naturalistic emotional speech mined from Creative-Commons podcasts, multi-rater annotated with categorical emotions and valence/ arousal/dominance attributes.
  • RAVDESS — Acted emotional speech and song from 24 professional actors across 8 emotions at two intensity levels (1,440 speech audio files).
  • Switchboard-1 Release 2 — ~2,400 two-sided spontaneous English telephone conversations among 543 US speakers (~260 h), separated into two channels.
  • VCTK — 44-hour English multi-speaker corpus with 110 speakers covering a wide range of UK and US accents; widely used for multi-speaker TTS and speaker adaptation research.
  • VoxCeleb1 — Speaker identification/verification corpus of 153,516 utterances from 1251 celebrities extracted from YouTube interview videos.
  • VoxCeleb2 — 2,442-hour large-scale speaker recognition corpus with 6,112 celebrities collected from YouTube across many languages.

tts

  • AISHELL-3 — 85-hour multi-speaker Mandarin TTS corpus with 218 speakers in clean recording conditions; the standard Chinese multi-speaker TTS baseline.
  • CSS10 — Single-speaker speech datasets for 10 languages built from aligned public-domain LibriVox clips, intended for TTS.
  • DAPS (Device and Produced Speech) — Professional studio-quality speech with time-aligned recordings of the same speech captured on consumer devices (tablet, smartphone) in real-world environments; 20 speakers.
  • Emilia — Large-scale multilingual in-the-wild speech dataset designed for expressive and diverse TTS training, covering 6 languages.
  • Emotional Speech Database (ESD) — >29 h of parallel emotional speech from 20 speakers (10 English, 10 Mandarin), each reading 350 parallel utterances across 5 emotions.
  • Expresso — High-quality multi-speaker English expressive speech at 48 kHz (11 h read + 30 h improvised) across many spontaneous expressive styles, for expressive speech resynthesis.
  • Hi-Fi Multi-Speaker English TTS (Hi-Fi TTS) — ~291.6 h high-quality English multi-speaker TTS from 10 LibriVox speakers (>=17 h each), 44.1 kHz, with Project Gutenberg text.
  • JSUT (Japanese speech corpus of Saruwatari-lab, U-Tokyo) — A ~10-hour single-speaker Japanese read-speech corpus designed for end-to-end TTS, covering the main pronunciations of daily-use Japanese characters.
  • LibriTTS — High-fidelity (24 kHz) read English audiobooks derived from LibriSpeech, with normalised transcriptions; the standard baseline for English TTS.
  • LibriTTS-R — A sound-quality-restored version of LibriTTS — 585 hours of 24 kHz English read speech from 2456 speakers, identical samples/texts to LibriTTS but enhanced via Google's Miipher speech restoration model.
  • LJSpeech — Single-speaker English TTS corpus (24 h, 13,100 clips) recorded from LibriVox readings. Universally used as a single-speaker TTS baseline.
  • Opencpop — High-quality Mandarin singing-voice synthesis corpus of 100 popular Chinese pop songs (3756 utterances) sung by a single female professional vocalist, 44.1 kHz, with phoneme/note boundary and pitch annotations.
  • Thorsten-Voice (German Neutral TTS) — A free single-speaker German read-speech corpus (~22,668 phrases, 22.05 kHz mono) recorded by Thorsten Müller for open TTS training, with neutral and emotional variants released over multiple years.
  • VCTK — 44-hour English multi-speaker corpus with 110 speakers covering a wide range of UK and US accents; widely used for multi-speaker TTS and speaker adaptation research.