Skip to content

Pipeline YAML Reference

A VoxKitchen pipeline is defined as a YAML file with three sections: metadata, ingest, and stages. If you are choosing a starting point for a task, start with Examples & Use Cases. Use this page when you want to edit, extend, or write a pipeline YAML file directly.

Full Schema

version: "0.1"                          # Required. Schema version.
name: my-pipeline                       # Required. Pipeline name.
description: "What this pipeline does"  # Optional.

work_dir: ./work/${name}-${run_id}      # Required. Output directory.
num_gpus: 1                             # Optional. GPU count (default: 1).
num_cpu_workers: null                   # Optional. CPU workers (default: auto).
gc_mode: aggressive                     # Optional. "aggressive" or "keep".

ingest:                                 # Required. How data enters the pipeline.
  source: dir | manifest | recipe       # Required. Ingest source type.
  recipe: librispeech                   # For source=recipe only.
  args:                                 # Source-specific arguments.
    root: /path/to/data
    recursive: true

stages:                                 # Required. Processing steps.
  - name: my_stage                      # Required. Unique stage name.
    op: operator_name                   # Required. Registered operator name.
    args:                               # Optional. Operator-specific config.
      param1: value1
      param2: value2

String Interpolation

Pipeline YAML supports variable substitution:

Variable Value
${name} Pipeline name.
${run_id} Generated run ID (e.g., run-20260415-a1b2c3).
${env:VAR} Value of environment variable VAR. Raises only if VAR is unset; an empty-string VAR passes through.
${env:VAR:-default} Value of VAR if set and non-empty, otherwise the literal default.
${env:VAR:?msg} Value of VAR if set and non-empty, otherwise raises with msg.

The three env: forms differ in how they treat an empty-string value of VAR, matching the corresponding POSIX shell parameter expansions:

  • ${env:VAR} accepts empty strings (only unset raises). Use this when an empty value is a meaningful configuration.
  • ${env:VAR:-default} and ${env:VAR:?msg} treat unset and empty identically. Prefer these when you want "missing or blank" to be one case. default may itself be empty: ${env:VAR:-} renders to the empty string when VAR is unset or blank.
work_dir: ./work/${name}-${run_id}              # → ./work/my-pipeline-run-20260415-a1b2c3
num_cpu_workers: ${env:WORKERS:-8}              # 8 unless WORKERS is exported
# pyannote_diarize wants a HuggingFace token; surface a clear error if missing.
stages:
  - name: diarize
    op: pyannote_diarize
    args:
      hf_token: ${env:HF_TOKEN:?set HF_TOKEN in ./.env}

A } cannot appear inside a default or error message — the parser stops at the first } character.

Ingest Sources

dir — Scan a directory for audio files

ingest:
  source: dir
  args:
    root: /path/to/audio          # Required.
    recursive: true               # Optional (default: true).

manifest — Load a pre-built CutSet

ingest:
  source: manifest
  args:
    path: /path/to/cuts.jsonl.gz  # Required.

recipe — Use a dataset recipe

ingest:
  source: recipe
  recipe: librispeech             # Recipe name.
  args:
    root: /path/to/librispeech    # Required.
    subsets: [train-clean-100]    # Optional. Default: all subsets.

Available recipes: librispeech, aishell, commonvoice, fleurs

Pipeline Execution

Resume

Pipelines checkpoint after each stage. If a run crashes, re-running resumes from the last completed stage:

vkit docker run pipeline.yaml                   # Auto-resume
vkit docker run pipeline.yaml --resume-from asr # Force resume from specific stage

Partial Execution

vkit docker run pipeline.yaml --stop-at vad     # Stop after VAD stage

Garbage Collection

By default (gc_mode: aggressive), intermediate audio files are cleaned up after downstream stages finish. Use --keep-intermediates to preserve all derived audio.

Pre-flight Validation

Before any data is processed — by vkit validate, by vkit run / vkit docker run (as a fail-fast gate ahead of the executor), and in --dry-run mode — a static pre-flight check runs over the stage chain.

Pre-flight seeds an available-field set from the ingest source — dir starts with just audio (plus custom.reference_text when a reference_text_glob is set); recipe adds supervision fields; manifest additionally assumes the metrics.* / custom.* namespaces may be present, since it loads a previously-built CutSet — then walks each stage in order, consulting each operator's field contract.

Outcome Trigger Effect
ERROR A stage's reads (or dynamic_reads) reference a field no upstream stage produces. Printed with stage name and missing field; exits with code 1.
WARNING A stage's optional_reads field is absent from the available set. Printed; pipeline is not blocked (the stage will skip or degrade gracefully).

After each stage, its writes tokens are added to the available set and its clears tokens are removed.

Example error — filtering on a metric nothing produces:

stages:
  - name: vad
    op: silero_vad
  - name: filter
    op: quality_score_filter
    args:
      conditions: ["metrics.snr > 20"]   # ERROR: metrics.snr never written

Pre-flight reports:

error: stage 'filter' (op 'quality_score_filter') requires 'metrics.snr' but no upstream stage produces it

Fix: add the producing stage before the filter:

stages:
  - name: vad
    op: silero_vad
  - name: snr
    op: snr_estimate          # writes metrics.snr
  - name: filter
    op: quality_score_filter
    args:
      conditions: ["metrics.snr > 20"]   # OK

Skipping pre-flight: pass --no-preflight to vkit validate, vkit run --dry-run, or vkit docker run if you need to bypass the check (e.g., while iterating on a partial pipeline). vkit docker run --no-preflight forwards the flag to the in-container run.

Pre-flight uses the live operator registry where possible, falling back to op_schemas.json (bundled inside Docker images) for operators not importable in the current environment.

Stage Execution

Stages execute sequentially. Each stage:

  1. Receives the CutSet from the previous stage
  2. Splits it across CPU/GPU workers when the operator is shard-safe
  3. Runs the operator on each shard
  4. Merges results and writes cuts.jsonl.gz + _SUCCESS marker + _stats.json

Failed cuts in shard-safe stages are logged to _errors.jsonl and skipped, so the pipeline continues. Batch exporters such as pack_huggingface run once over the whole CutSet and fail atomically to avoid partial output directories.