CN-Celeb
1,200-hour multi-genre Mandarin speaker recognition corpus spanning 11 real-world scenarios collected from Chinese celebrities.
- Task: speaker
- Languages: zh
- Hours: 1200
- Domain: celebrity speech (in-the-wild)
- License: see source terms
- Homepage: https://www.openslr.org/82
- Paper: https://arxiv.org/abs/1911.01799
Recommendation
The primary benchmark for Mandarin speaker verification and identification in realistic, in-the-wild conditions. Wide acoustic diversity across genres (interview, singing, entertainment) is valuable but also makes it challenging. Check the source terms carefully — the data is free for research but redistribution is restricted.
Getting the data
Downloadable via VoxKitchen (cnceleb, source: openslr, size: 20.7 GB):
vkit docker download --tag slim cnceleb --root ./data/cnceleb
Subsets: cn-celeb_v2.
Suggested processing
A recommended VoxKitchen pipeline ships in the repository at voxkitchen/templates/pipelines/speaker-analysis.yaml — run it with vkit docker run.