Pipeline YAML Reference
A VoxKitchen pipeline is defined as a YAML file with three sections: metadata, ingest, and stages. If you are choosing a starting point for a task, start with Examples & Use Cases. Use this page when you want to edit, extend, or write a pipeline YAML file directly.
Full Schema
version: "0.1" # Required. Schema version.
name: my-pipeline # Required. Pipeline name.
description: "What this pipeline does" # Optional.
work_dir: ./work/${name}-${run_id} # Required. Output directory.
num_gpus: 1 # Optional. GPU count (default: 1).
num_cpu_workers: null # Optional. CPU workers (default: auto).
gc_mode: aggressive # Optional. "aggressive" or "keep".
ingest: # Required. How data enters the pipeline.
source: dir | manifest | recipe # Required. Ingest source type.
recipe: librispeech # For source=recipe only.
args: # Source-specific arguments.
root: /path/to/data
recursive: true
stages: # Required. Processing steps.
- name: my_stage # Required. Unique stage name.
op: operator_name # Required. Registered operator name.
args: # Optional. Operator-specific config.
param1: value1
param2: value2
String Interpolation
Pipeline YAML supports variable substitution:
| Variable | Value |
|---|---|
${name} |
Pipeline name. |
${run_id} |
Generated run ID (e.g., run-20260415-a1b2c3). |
${env:VAR} |
Value of environment variable VAR. Raises only if VAR is unset; an empty-string VAR passes through. |
${env:VAR:-default} |
Value of VAR if set and non-empty, otherwise the literal default. |
${env:VAR:?msg} |
Value of VAR if set and non-empty, otherwise raises with msg. |
The three env: forms differ in how they treat an empty-string value of
VAR, matching the corresponding POSIX shell parameter expansions:
${env:VAR}accepts empty strings (only unset raises). Use this when an empty value is a meaningful configuration.${env:VAR:-default}and${env:VAR:?msg}treat unset and empty identically. Prefer these when you want "missing or blank" to be one case.defaultmay itself be empty:${env:VAR:-}renders to the empty string whenVARis unset or blank.
work_dir: ./work/${name}-${run_id} # → ./work/my-pipeline-run-20260415-a1b2c3
num_cpu_workers: ${env:WORKERS:-8} # 8 unless WORKERS is exported
# pyannote_diarize wants a HuggingFace token; surface a clear error if missing.
stages:
- name: diarize
op: pyannote_diarize
args:
hf_token: ${env:HF_TOKEN:?set HF_TOKEN in ./.env}
A } cannot appear inside a default or error message — the parser stops at
the first } character.
Ingest Sources
dir — Scan a directory for audio files
ingest:
source: dir
args:
root: /path/to/audio # Required.
recursive: true # Optional (default: true).
manifest — Load a pre-built CutSet
ingest:
source: manifest
args:
path: /path/to/cuts.jsonl.gz # Required.
recipe — Use a dataset recipe
ingest:
source: recipe
recipe: librispeech # Recipe name.
args:
root: /path/to/librispeech # Required.
subsets: [train-clean-100] # Optional. Default: all subsets.
Available recipes: librispeech, aishell, commonvoice, fleurs
Pipeline Execution
Resume
Pipelines checkpoint after each stage. If a run crashes, re-running resumes from the last completed stage:
vkit docker run pipeline.yaml # Auto-resume
vkit docker run pipeline.yaml --resume-from asr # Force resume from specific stage
Partial Execution
vkit docker run pipeline.yaml --stop-at vad # Stop after VAD stage
Garbage Collection
By default (gc_mode: aggressive), intermediate audio files are cleaned up after downstream stages finish. Use --keep-intermediates to preserve all derived audio.
Pre-flight Validation
Before any data is processed — by vkit validate, by vkit run /
vkit docker run (as a fail-fast gate ahead of the executor), and in
--dry-run mode — a static pre-flight check runs over the stage chain.
Pre-flight seeds an available-field set from the ingest source — dir starts
with just audio (plus custom.reference_text when a reference_text_glob is
set); recipe adds supervision fields; manifest additionally assumes the
metrics.* / custom.* namespaces may be present, since it loads a
previously-built CutSet — then walks each stage in order, consulting each
operator's field contract.
| Outcome | Trigger | Effect |
|---|---|---|
| ERROR | A stage's reads (or dynamic_reads) reference a field no upstream stage produces. |
Printed with stage name and missing field; exits with code 1. |
| WARNING | A stage's optional_reads field is absent from the available set. |
Printed; pipeline is not blocked (the stage will skip or degrade gracefully). |
After each stage, its writes tokens are added to the available set and its
clears tokens are removed.
Example error — filtering on a metric nothing produces:
stages:
- name: vad
op: silero_vad
- name: filter
op: quality_score_filter
args:
conditions: ["metrics.snr > 20"] # ERROR: metrics.snr never written
Pre-flight reports:
error: stage 'filter' (op 'quality_score_filter') requires 'metrics.snr' but no upstream stage produces it
Fix: add the producing stage before the filter:
stages:
- name: vad
op: silero_vad
- name: snr
op: snr_estimate # writes metrics.snr
- name: filter
op: quality_score_filter
args:
conditions: ["metrics.snr > 20"] # OK
Skipping pre-flight: pass --no-preflight to vkit validate,
vkit run --dry-run, or vkit docker run if you need to bypass the check
(e.g., while iterating on a partial pipeline). vkit docker run --no-preflight
forwards the flag to the in-container run.
Pre-flight uses the live operator registry where possible, falling back to
op_schemas.json (bundled inside Docker images) for operators not importable
in the current environment.
Stage Execution
Stages execute sequentially. Each stage:
- Receives the CutSet from the previous stage
- Splits it across CPU/GPU workers when the operator is shard-safe
- Runs the operator on each shard
- Merges results and writes
cuts.jsonl.gz+_SUCCESSmarker +_stats.json
Failed cuts in shard-safe stages are logged to _errors.jsonl and skipped, so
the pipeline continues. Batch exporters such as pack_huggingface run once
over the whole CutSet and fail atomically to avoid partial output directories.