AVSpeech
A large-scale audio-visual dataset of ~4700 hours of 3-10 second clips drawn from ~290k YouTube videos, each segment featuring a single visible speaker with clean speech, released for the "Looking to Listen at the Cocktail Party" speech-separation work.
- Task: asr, speaker
- Languages: multi
- Hours: 4700
- Domain: youtube audio-visual
- License: see source terms
- Homepage: https://looking-to-listen.github.io/avspeech/
- Paper: https://arxiv.org/abs/1804.03619
Recommendation
Pick for audio-visual speech separation, speaker-conditioned source separation, and lip-sync / talking-face research where clean single-speaker reference segments are needed; useful as a pretraining source for AV speech models. Distributed as CSV segment lists referencing YouTube — expect link rot and YouTube ToS constraints.
Getting the data
Obtain from the dataset homepage.
No transcripts; users must download clips from YouTube themselves under Google research terms + YouTube ToS.
Suggested processing
A recommended VoxKitchen pipeline ships in the repository at voxkitchen/templates/pipelines/speaker-analysis.yaml — run it with vkit docker run.