Multi-env architecture
Status: implemented. This page records the design rationale and the current runtime shape.
Why
A single Python environment cannot host all 52 operators. Concrete conflicts verified against wheel metadata:
pyannote-audio>=4.0⇒torch>=2.8andnumpy>=2.1funasr + modelscope⇒ effectively capped at thetransformers<5/numpy<2stackinaSpeechSegmenter⇒tensorflow[and-cuda]+onnxruntime-gpuChatTTS/CosyVoice/Fish-Speechhave transitive pins that fight with the ASR stack
Any "one image, all operators" approach either (a) silently downgrades deps with
pip install || echo WARN — producing an image that looks healthy but breaks
at runtime — or (b) refuses to resolve at all.
Core idea
VoxKitchen already checkpoints each stage to disk as cuts.jsonl.gz. Stages do not
share memory. Therefore each stage can run in its own Python interpreter with its
own installed packages, provided we route inputs and outputs via the existing disk
checkpoints.
Shape
One Docker image can contain multiple Python environments inside it, created
with uv. A thin dispatch layer in the pipeline runner decides which env each
stage runs in.
/opt/voxkitchen/
envs/
core/ # CPU torch 2.4; audio/segment/quality/pack/pitch/dnsmos/classify/enhance/codec/viz
asr/ # GPU torch 2.4; asr/whisper/funasr/align (no diarize)
diarize/ # GPU torch 2.4; pyannote 3.x only — separate from asr so :diarize can ship small
tts/ # GPU torch 2.4; tts-kokoro/chattts/cosyvoice
fish-speech/ # GPU torch 2.8; tts-fish-speech (upstream pins torch 2.8 / numpy 2.1)
op_schemas.json # every op's pydantic schema, keyed by op name
op_env_map.json # op name → env name
model_cache/ # shared HF / torch / modelscope cache across envs
Why fish-speech is its own env: its torch==2.8.0 pin is incompatible with
ChatTTS / CosyVoice / kokoro on torch 2.4. Forcing them onto torch 2.8 would
expand the risk surface from "one broken TTS" to "all TTS broken". Isolating
fish-speech costs substantial image size and one extra venv to warm; in
return every other env stays on its validated stack.
Fish-Speech uses the S2 inference API
(fish_speech.inference_engine.TTSInferenceEngine) with a queue-based Llama
generator and DAC decoder. VoxKitchen wires that stack inside the isolated
fish-speech env, and latest includes that env for mixed-runtime pipelines.
The vkit command on $PATH is a shim that routes into envs/core/bin/python. The
core env is the parent: it loads the pipeline YAML, decides per-stage envs, and
dispatches.
Data flow
vkit docker run pipeline.yaml
│
▼ (container entrypoint)
vkit run pipeline.yaml
│
▼ (core env, parent process)
load spec → validate args against op_schemas.json
│
▼
for each stage:
target_env = resolve_env(stage.op)
if target_env == "core":
run in-process (existing CpuPoolExecutor / GpuPoolExecutor)
else:
write input cuts.jsonl.gz (already present from prior stage)
spawn: /opt/voxkitchen/envs/<target_env>/bin/python \
-m voxkitchen.runtime.stage_runner \
--op <name> --config-json <json> \
--input <prev/cuts.jsonl.gz> \
--output <this/cuts.jsonl.gz> \
--ctx-json <ctx>
wait, check exit code, surface stderr on failure
// next stage reads <this/cuts.jsonl.gz> from disk — same as today
The disk-based stage boundary already exists; subprocess dispatch is additive.
Components
voxkitchen/runtime/env_resolver.py
Resolves an operator name to an env name. Does NOT import the operator class — only reads two small JSON files so the parent (core env) can decide dispatch for operators it cannot import.
def resolve_env(op_name: str) -> str: ...
def current_env() -> str: ... # reads $VKIT_ENV, set by each venv's bin/activate
Lookup order:
1. $VKIT_OP_ENV_MAP (override for tests) →
2. /opt/voxkitchen/op_env_map.json (docker) →
3. In-process fallback: walk registered operators, derive from required_extras
The fallback matters for source-tree development and unit tests where there is no prebuilt image map. It is not the supported user pipeline execution path.
voxkitchen/runtime/stage_runner.py
Subprocess entry point. Self-contained, runs inside any env that has the operator's deps installed.
python -m voxkitchen.runtime.stage_runner \
--op <name> \
--config-json <json-string> \
--input <cuts.jsonl.gz> \
--output <cuts.jsonl.gz> \
--ctx-json <json-string>
Behavior:
1. Import voxkitchen.operators (populates registry with whatever this env can load)
2. Read input cuts with CutSet.from_jsonl_gz
3. Resolve op, validate config, pick executor (CPU pool or GPU pool)
4. Run, write output, write _errors.jsonl and _stats.json
5. Exit 0 on success, non-zero on unrecoverable failure
The parent treats this process as a black box: same pipeline, just remote.
voxkitchen/runtime/dump_schemas.py
Run once per env at image build time. Walks the registered operators and emits a
JSON object {op_name: {schema, required_extras, device}}. Output is merged across
envs in the Dockerfile.
voxkitchen/runtime/merge_schemas.py
Combines per-env dumps into the final op_schemas.json and op_env_map.json.
Detects when the same operator is registered in multiple envs (which should not
happen after this refactor — a symptom of an incorrect EXTRA_TO_ENV mapping).
voxkitchen/pipeline/runner.py
The runner is env-aware at the stage boundary:
- Resolve
target_env = resolve_env(stage.op). - If
target_env == current_env(), run the operator in process withCpuPoolExecutororGpuPoolExecutor. - Otherwise, write the stage input manifest to disk and call
dispatch_stage_to_env(...), which spawnsvoxkitchen.runtime.stage_runnerin the target env.
stage_runner then re-runs the normal in-process executor path inside that
env, so CPU sharding, GPU pinning, progress bars, _errors.jsonl, and
_stats.json behave the same as local stages.
Environment construction
Tool: uv
- ~10× faster installs than pip
- Lockfile support (
uv.lock) — committed to repo for reproducibility - Supports PyTorch CUDA wheels correctly
- Astral Inc. maintained, stable
Per-env constraints
docker/constraints/{core,asr,tts}.txt pin shared deps. Constraints are stricter per
env than before because we no longer need one set to cover all operators — each env
can pin to exactly what its extras agree on:
core:torch==2.4.1+cpu,numpy<2.0,transformers>=4.40,<5.0asr:torch==2.4.1,numpy<2.0,transformers>=4.40,<5.0,huggingface_hub>=0.23,<1.0,ctranslate2>=4.4,<5.0tts:torch==2.4.1,numpy<2.0,transformers>=4.40,<5.0(modelscope may add more pins — TBD on first build)
Lockfiles
Each env maintains its own uv.lock:
docker/uv.lock.coredocker/uv.lock.asrdocker/uv.lock.tts
docker build uses --frozen to fail the build if lockfiles drift from pyproject.toml.
CI regenerates lockfiles on a schedule (weekly) and opens a PR — keeps them fresh without surprising builds.
Dockerfile
One Dockerfile with six BuildKit targets:
target=slim: core env only, torch-cpu, ~13 GBtarget=asr: core + asr env, ~48 GBtarget=diarize: core + diarize env (pyannote only), ~32 GBtarget=tts: core + tts env, ~44 GBtarget=fish-speech: core + fish-speech env (isolated torch 2.8, S2 cached), ~57 GBtarget=latest: all five envs merged, ~123 GB
FROM pytorch/pytorch:2.4.1-cuda12.4-cudnn9-runtime AS base
COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/uv
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential python3-dev ffmpeg libsndfile1 espeak-ng sox \
&& rm -rf /var/lib/apt/lists/*
FROM base AS core-env
RUN uv venv /opt/voxkitchen/envs/core --python 3.11 && \
uv pip install --python /opt/voxkitchen/envs/core/bin/python \
-c docker/constraints/core.txt \
-e ".[audio,segment,quality,pack,pitch,dnsmos,classify,enhance,codec,viz]"
# warmup + schema dump for core
RUN /opt/voxkitchen/envs/core/bin/python scripts/warmup_models.py --group core
RUN /opt/voxkitchen/envs/core/bin/python -m voxkitchen.runtime.dump_schemas \
--env core --out /tmp/schemas_core.json
FROM core-env AS slim
# Just the vkit shim → core env, plus a minimal op_env_map.json & op_schemas.json
RUN /opt/voxkitchen/envs/core/bin/python -m voxkitchen.runtime.merge_schemas \
/tmp/schemas_core.json --out /opt/voxkitchen/op_schemas.json
COPY docker/vkit-shim.sh /usr/local/bin/vkit
ENTRYPOINT ["vkit"]
FROM core-env AS asr-env
RUN uv venv /opt/voxkitchen/envs/asr --python 3.11 && \
uv pip install --python /opt/voxkitchen/envs/asr/bin/python \
-c docker/constraints/asr.txt \
-e ".[audio,segment,quality,pack,pitch,dnsmos,classify,enhance,codec,viz,asr,whisper,funasr,align]"
RUN /opt/voxkitchen/envs/asr/bin/python scripts/warmup_models.py --group asr
RUN /opt/voxkitchen/envs/asr/bin/python -m voxkitchen.runtime.dump_schemas \
--env asr --out /tmp/schemas_asr.json
FROM asr-env AS tts-env
RUN uv venv /opt/voxkitchen/envs/tts --python 3.11 && \
uv pip install --python /opt/voxkitchen/envs/tts/bin/python \
-c docker/constraints/tts.txt \
-e ".[audio,segment,quality,pack,pitch,dnsmos,classify,enhance,codec,viz,tts-kokoro,tts-chattts,tts-cosyvoice]"
RUN /opt/voxkitchen/envs/tts/bin/python scripts/warmup_models.py --group tts
RUN /opt/voxkitchen/envs/tts/bin/python -m voxkitchen.runtime.dump_schemas \
--env tts --out /tmp/schemas_tts.json
FROM tts-env AS latest
RUN /opt/voxkitchen/envs/core/bin/python -m voxkitchen.runtime.merge_schemas \
/tmp/schemas_core.json /tmp/schemas_asr.json /tmp/schemas_tts.json \
--out /opt/voxkitchen/op_schemas.json
RUN /opt/voxkitchen/envs/core/bin/vkit doctor --expect-all
COPY docker/vkit-shim.sh /usr/local/bin/vkit
ENTRYPOINT ["vkit"]
Build commands:
docker build --target latest -t voxkitchen:latest .
docker build --target slim -t voxkitchen:slim .
CLI surface
The user-facing path stays vkit docker .... The main env-aware command is
vkit doctor --expect <env>, and the latest image can aggregate doctor
results across installed envs. vkit validate uses exported schemas when the
current env cannot import an operator directly.
Risks and mitigations
| Risk | Mitigation |
|---|---|
| Subprocess spawn latency (~0.5s × N stages) | Only crossed when the op's env ≠ current. Core-only pipelines never spawn. |
| Error propagation across subprocess boundary | stage_runner writes _errors.jsonl and _stats.json as today. Exit code and stderr pass through dispatch_stage_to_env. |
| Cross-env pickle mismatch | We do NOT pickle across envs. Everything crosses via jsonl.gz on disk (Pydantic v2 serialization is env-agnostic). |
| Operator registration differs per env | Each env runs its own dump_schemas.py. Parent trusts the merged op_env_map.json — it doesn't need to import. |
| User adds a custom operator via plugin entry_points | The plugin's env must be declared. We extend op_env_map.json to accept operator→env overrides from $VKIT_EXTRA_OP_ENV_MAP. |
| Lockfile drift | CI regenerates weekly + --frozen in Dockerfile. |
| GPU memory not released between stages | Subprocess exits between stages, so this is automatic. Better than current behavior within a single process. |
| Resume across env switches | Checkpoint format is unchanged (cuts.jsonl.gz + _SUCCESS marker). Resume reads the file; it doesn't know or care which env produced it. |
Implemented pieces
- Runtime modules:
env_resolver,dispatch,stage_runner,dump_schemas,merge_schemas. - Schema-driven validation for operators that are not importable in the current env.
- Unified
docker/Dockerfilewith BuildKit targets forslim,asr,diarize,tts,fish-speech, andlatest. vkit doctor --expect <env>plus multi-env aggregation inlatest.- End-to-end tests for local and cross-env stage execution.
What does NOT change
- Operator authoring contract: subclass
Operator, declarename,config_cls,required_extras. - YAML surface is unchanged. Host users run it with
vkit docker run; the image entrypoint still callsvkit runinternally. - Checkpoint / resume semantics.
- GC / trash behavior.
- Tests: existing operator tests run in the env that has their deps. Parent-env smoke tests run in
core.