Skip to content

Getting Started

VoxKitchen's recommended user path is Docker-first: install the lightweight vkit launcher locally, then run pipelines inside prebuilt Docker images. You do not need to install model dependencies on your host machine.

Install The Launcher

Requirements:

  • Docker
  • Python 3.10+ for the lightweight vkit CLI
pipx install voxkitchen      # recommended — isolates the launcher
# or
pip install voxkitchen

This installs only the lightweight launcher and inspection commands (a few MB, no torch / ASR / TTS dependencies). All pipeline runtime dependencies stay inside the prebuilt Docker images.

Pull A Runtime Image

Start with the slim image for the demo. For your own pipelines, pick the smallest image that contains the operators you use. Mixed pipelines may need latest.

vkit docker pull --tag slim
Tag Use when
slim CPU-friendly cleaning, VAD, quality metrics, packing
asr Faster-Whisper, FunASR, Qwen3-ASR, forced alignment
diarize Pyannote speaker diarization
tts Kokoro, ChatTTS, CosyVoice
fish-speech Fish-Speech isolated runtime
latest Mixed pipelines across ASR, diarization, TTS, Fish-Speech, and core operators

Not sure which image your YAML needs? Run vkit validate pipeline.yaml; it prints the recommended vkit docker pull --tag ... and run command.

Command flags and tag behavior are listed in the CLI reference.

Run The Demo

The published image includes example pipelines and demo audio, so no repository clone is required for this quick start. Start with the slim demo; use latest later for pipelines that mix ASR, diarization, and TTS operators.

vkit docker run --tag slim examples/pipelines/demo-no-asr.yaml --dry-run
vkit docker run --tag slim examples/pipelines/demo-no-asr.yaml
vkit inspect run ./work/demo-no-asr

Create Your First Project

Use a template, put audio under data/, validate the plan, then run in Docker.

vkit init my-project --template asr
cd my-project

cp /path/to/your/audio/*.wav data/

vkit docker run --tag asr pipeline.yaml --dry-run
vkit docker run --tag asr pipeline.yaml
vkit inspect run work/

Available templates:

Template Use case Suggested image
cleaning Quality metrics, dedup, filtering slim
asr VAD, augmentation, ASR labeling, packing asr
speaker Diarization, embeddings, language/gender labels latest
tts Denoise, segment, transcribe, align, pack asr

See all templates:

vkit init --list-templates

Inspect Results

vkit inspect run work/
vkit inspect cuts <work_dir>/<stage>/cuts.jsonl.gz
vkit inspect errors work/

vkit docker run writes run artifacts under ./work and exported datasets under ./output with your host user ID. It also mounts ./data automatically when that directory exists.

Download A Dataset

Dataset download also runs through Docker, so recipe dependencies stay inside the runtime image and data lands under your project's ./data directory.

vkit init ls-project --template asr
cd ls-project
vkit docker download --tag slim librispeech --root ./data/librispeech --subsets dev-clean
# Edit pipeline.yaml: set ingest.args.root to ./data/librispeech
vkit docker run --tag asr pipeline.yaml

Run vkit recipes to list every available dataset along with its compressed download size — 9 recipes ship today, covering English / Chinese ASR (librispeech, aishell), multi-speaker TTS (libritts, ljspeech, aishell3), Chinese speaker recognition (cnceleb), multilingual eval (fleurs), augmentation (musan), plus commonvoice (manual download). See Dataset Catalog for per-recipe subset details.

Configuration

Some operators require API tokens. Put them in ./.env; vkit docker run passes that file into the container automatically.

cp .env.example .env
Variable Required by Notes
HF_TOKEN pyannote_diarize Accept the pyannote model agreement on HuggingFace first.

Next Steps