Building VoxKitchen Docker images
Most users should use vkit docker pull --tag <tag> to fetch the prebuilt
images from
ghcr.io/xqfeng-josie/voxkitchen:{slim,asr,diarize,tts,fish-speech,latest}.
Build locally only when:
- Air-gapped / internal registries that cannot reach GHCR
- Apple Silicon / non-
linux/amd64hosts until we ship multi-arch images - Custom CUDA versions (the published images pin CUDA 12.4)
- Adding private operators or baking in your own models
- Security audits that require a reproducible build from source
Targets
The single docker/Dockerfile exposes six
BuildKit targets:
| Target | What it contains | Approx. size | GPU |
|---|---|---|---|
slim |
core venv only |
~13 GB | no |
asr |
core + asr venv (faster-whisper / funasr / qwen3 / forced align) |
~48 GB | yes |
diarize |
core + diarize venv (pyannote only) |
~32 GB | yes |
tts |
core + tts venv (kokoro / ChatTTS / CosyVoice) |
~44 GB | yes |
fish-speech |
core + fish-speech venv (torch 2.8 stack, S2 model cached) |
~57 GB | yes |
latest |
all 5 envs merged in one image | ~123 GB | yes |
Base image for GPU targets: pytorch/pytorch:2.4.1-cuda12.4. Slim uses
torch==2.4.1+cpu inside the same base image; the CUDA libs are unused
and can be ignored on CPU-only hosts.
The five venvs are isolated:
core/asr/diarize/ttsshare a torch 2.4 + numpy 1.x stack.fish-speechhas its own torch 2.8 + numpy 2.1 stack (upstream requirement).- Diarize is its own env (not bundled into
asr) so the:diarizeimage can ship without the full ASR stack for users who only need speaker segmentation. - Pipelines with a
tts_fish_speechstage get that stage dispatched to thefish-speechenv via subprocess; other stages stay wherever they belong. :latestis assembled viaCOPY --from=<env>so BuildKit can build each env branch in parallel.
Both use the PyTorch CUDA base image. The slim target installs torch-cpu
wheels into its core venv and is runnable without a GPU; the base image's
CUDA libraries are unused there.
Why one image with multiple envs?
A single Python environment cannot host all 52 operators — pyannote.audio>=4.0
wants torch>=2.8 + numpy>=2.1, funasr + modelscope are capped at the
transformers<5 / numpy<2 stack, and the four TTS engines each pull transitive
deps that fight the ASR stack on transformers / ctranslate2 versions.
VoxKitchen already checkpoints every pipeline stage to cuts.jsonl.gz on disk.
Stages don't share in-memory state, so they can run in different Python
interpreters if the parent runner routes them. The latest image ships five
venvs at /opt/voxkitchen/envs/{core,asr,diarize,tts,fish-speech}/ and a
mapping from operator name to env at /opt/voxkitchen/op_env_map.json. The
pipeline runner in the core venv reads that map and spawns stage_runner
in the target venv for cross-env stages. Stages communicate only via the
existing on-disk checkpoints.
Full design: docs/architecture/multi-env.md.
Build commands
docker build --target slim -f docker/Dockerfile -t voxkitchen:slim .
docker build --target latest -f docker/Dockerfile -t voxkitchen:latest .
vkit docker build <target> is the preferred local wrapper. It runs the same
Dockerfile build and defaults Docker client scratch paths to ./.docker:
DOCKER_CONFIG=./.docker/configTMPDIR=./.docker/tmpBUILDX_CONFIG=./.docker/buildxXDG_CACHE_HOME=./.docker/cache
Override the base directory with VKIT_DOCKER_WORK_DIR:
VKIT_DOCKER_WORK_DIR=/data2/xiaoqinfeng/workdir/VoxKitchen/.docker \
vkit docker build asr
The release script uses the same default before building and pushing images.
If DOCKER_CONFIG is project-local, log in to GHCR with the same config:
mkdir -p .docker/config
DOCKER_CONFIG="$PWD/.docker/config" docker login ghcr.io
Important: these variables move Docker CLI temp/config/cache files only. Image
layers and BuildKit layer cache are stored by the Docker daemon's data-root,
which is commonly /var/lib/docker and may still fill /. To move that data,
configure Docker daemon storage, for example:
{
"data-root": "/data2/xiaoqinfeng/docker-data"
}
Put that in /etc/docker/daemon.json and restart Docker. Use a path outside
the repository; daemon layer storage is much larger than ./.docker.
Each build ends with a per-env vkit doctor --expect <env> smoke test. If any
expected operator fails to register in its env, the build fails loudly rather
than shipping an image that looks healthy but breaks at runtime.
Faster rebuilds (BuildKit cache + GHCR layer cache)
A full cold build of all six targets takes ~1-2 hours and re-downloads several GB of wheels per env (torch alone is ~2.5 GB per CUDA venv). Two layers of cache make rebuilds — and builds from a fresh machine — much faster.
1. uv wheel cache (always on, no setup)
docker/Dockerfile mounts a BuildKit cache at /root/.cache/uv on
every RUN uv pip install:
RUN --mount=type=cache,id=uv-cache,target=/root/.cache/uv,sharing=locked \
uv pip install --python /opt/voxkitchen/envs/<env>/bin/python ...
The cache is keyed by id=uv-cache and shared across all stages, so
torch 2.4.1+cu124 downloads once and the same wheel is reused by
core / asr / diarize / tts; fish-speech reuses its own torch
2.8 wheel on the next rebuild. The cache mount never lands in a final
image layer — it lives in BuildKit's local cache volume.
This is enabled automatically by Docker's default BuildKit. Both
docker build and docker buildx build honor it without flags.
To inspect or clear the local BuildKit cache:
docker buildx du # bytes on disk per cache type
docker buildx prune --filter type=exec.cachemount # clear uv wheel cache
docker buildx prune # clear everything
2. GHCR registry layer cache (release script only)
The uv cache above lives on a single machine. scripts/release.sh
also publishes BuildKit's layer cache to GHCR so it survives
across machines and across docker buildx prune:
- After every successful target build, the script pushes layer
descriptors to
ghcr.io/xqfeng-josie/voxkitchen:buildcache-<target>via--cache-to type=registry,mode=max. - On the next run — on the same machine or any other one logged into
GHCR —
--cache-frompulls those descriptors and BuildKit reuses unchanged layers without re-running the underlyingRUNstep.
The buildcache-<target> tags are not runnable images; they
carry only layer manifests. They are public, so anyone can pull from
them as a read-only mirror; pushing requires write:packages on the
GHCR repo.
Registry cache requires the docker-container BuildKit driver — the
default docker driver only supports cache-to type=inline. The
release script creates and reuses a builder named voxkitchen-builder
automatically on first run:
docker buildx create --name voxkitchen-builder --driver docker-container --use
If you want the same registry-cache behavior from a local vkit docker
build (or a direct docker buildx build), build with buildx
yourself. Read-only consumption of the public cache needs no
authentication:
docker buildx create --name vk-builder --driver docker-container --use # one-time
docker buildx build \
--target asr \
--cache-from type=registry,ref=ghcr.io/xqfeng-josie/voxkitchen:buildcache-asr \
--load \
-f docker/Dockerfile -t voxkitchen:asr .
To also write to the cache (for example, you maintain a private
fork and want your CI builds to share state), point --cache-to at a
repo you have write access to:
--cache-to type=registry,ref=ghcr.io/<your-user>/voxkitchen-cache:asr,mode=max
mode=max is important — mode=min only exports the final stage's
layer, which discards most of the savings for multi-stage Dockerfiles
like this one.
Expected timings (release.sh, single machine)
| Scenario | latest |
|---|---|
| First build, cold cache | ~1-2 h |
Rebuild after voxkitchen/** source change only |
~5-10 min |
| Rebuild on a fresh machine after GHCR cache pull | ~10-20 min (cache download dominates) |
Rebuild after pyproject.toml extras change |
~30-60 min (re-runs the affected env's install) |
The single-env targets (slim, asr, diarize, tts, fish-speech)
each take a fraction of latest's time because they build only one
env branch. Use --target slim for the fastest smoke test loop while
hacking on core operators.
When cache misses happen
Cache lookup is keyed by the exact RUN command string plus all
preceding layer hashes. Common things that invalidate downstream
layers:
| Change | Invalidates from |
|---|---|
pyproject.toml (any line) |
COPY pyproject.toml → all later venv installs |
docker/constraints/<env>.txt |
that env's RUN uv pip install and everything after in the stage |
docker/Dockerfile itself (e.g. add a RUN) |
the changed line and everything after in the same stage |
voxkitchen/** source |
COPY . . (line ~124) and the final pip install -e . --no-deps; warmup + schema dump rerun |
Base image tag (pytorch/pytorch:2.4.1-cuda12.4-cudnn9-runtime) |
everything |
Warmup (model downloads) reruns whenever its RUN step's cache is
invalidated, which currently includes any source change. Treat
warmup as the most expensive step after wheel installs and re-order
Dockerfile edits accordingly if you find yourself paying for it often.
Including the pyannote diarization model at build time
pyannote/speaker-diarization-3.1 is gated. Two approaches:
(a) Supply the token during build — the model is baked into the image:
docker build --target latest --build-arg HF_TOKEN=hf_xxx \
-f docker/Dockerfile -t voxkitchen:latest .
(b) Omit at build — pass HF_TOKEN at run time, pyannote downloads on
first use:
# Put HF_TOKEN=hf_xxx in ./.env, then run through the vkit wrapper.
vkit docker run --image voxkitchen:latest pipeline.yaml
You must accept the model agreement on HuggingFace with the same account as the token, or pyannote returns 403.
Do not bake a personal HF token into an image you then publish — it remains retrievable in the image layer history.
Checking what's inside an image
vkit docker doctor --image voxkitchen:latest # per-env table
vkit docker doctor --image voxkitchen:latest --json # machine-readable
doctor in the latest image auto-detects the five envs and runs each
one's self-check in a subprocess, producing one row per env plus a detail
list of any missing operators. In slim (or any single-env target) it
shows the single-env report.
During image build we also write /opt/voxkitchen/warmup_<env>.json for each
env; doctor reads these to report model-cache status.
CUDA version mismatch
Both targets pin CUDA 12.4 via the pytorch/pytorch:2.4.1-cuda12.4-cudnn9-runtime
base image. If your driver only supports an older CUDA, edit the first FROM
line in docker/Dockerfile to match — PyTorch publishes matching base images
at pytorch/pytorch:2.4.1-cuda{11.8,12.1}-cudnn9-runtime. You will also need
to change the --index-url https://download.pytorch.org/whl/cu124 lines in
the asr-env and tts-env stages to the matching CUDA version.
Keep torch==2.4.1 in docker/constraints/*.txt unless you intentionally want
a different torch minor — some extras pin specific torch versions and relaxing
this here will not help if they re-pin downstream.
Adding a new extras group
- Add the group to
pyproject.tomlunder[project.optional-dependencies]. - Decide which env should ship it. Update
EXTRA_TO_ENVinvoxkitchen/runtime/env_resolver.py— this is the source of truth for "which env runs ops with this extra". - Add the extras name to the
pip install -e ".[...]"line in the matching stage ofdocker/Dockerfile(core-env / asr-env / tts-env). - Add the operators it enables to
EXPECTED_OPERATORSinvoxkitchen/cli/doctor.pyunder the same env key, so the build-time smoke test catches regressions. - If the new extras introduces a shared dep that conflicts with existing
pins, update
docker/constraints/{core,asr,tts}.txtto a mutually compatible version — or, if no compatible version exists, the new extras belongs in a new env (expand the design, don't force-merge).
The gender extras gotcha
inaSpeechSegmenter pins tensorflow[and-cuda] + onnxruntime-gpu which
won't install alongside either the asr or the tts stack. The gender extras
group is therefore not installed in any image. The gender_classify operator
is still registered and works with method=f0 (pitch-based, no model) or
method=speechbrain (uses the classify extras, which IS installed in core).
Users who truly need method=inaspeechsegmenter can install it at runtime in
a throwaway container: pip install inaSpeechSegmenter inside the running
image and run again.
Debugging a failed build
docker build --target slimfirst — it's smaller and catches basic apt / pyproject / uv issues before you spend time on GPU layers.- If slim passes but
latestfails in asr-env or tts-env, the issue is almost always a transitive dep conflict in that env. Re-run that stage manually:
docker build --target core-env -f docker/Dockerfile -t vk:debug .
docker run --rm -it vk:debug bash
# inside:
uv pip install --python /opt/voxkitchen/envs/core/bin/python -e ".[asr,funasr]"
warmup_*.jsonbeing empty / missing in a built image means warmup failed silently (we use|| trueso missing models don't fail the build).vkit doctorsurfaces which models didn't download.
What this doc doesn't cover
- Publishing to GHCR — that lives in
RELEASING.mdand is driven byscripts/release.sh. The release script is also what wires up the GHCR layer cache described above. - Multi-arch builds (arm64). The buildx command is straightforward; the longer story is validating every extras group on arm64 wheels, which we haven't done.