System Architecture

Operators: 52 across 8 categories.

Overview

VoxKitchen is a Docker-first speech data pipeline toolkit. Users write a YAML pipeline, run vkit docker run, and get training-ready datasets with full provenance tracking.

Core metaphor: pipeline.yaml is a recipe, operators are cooking steps, ingest recipes are ingredient prep, pack is plating.

Target users: Speech researchers (ASR, TTS, speaker recognition, speech LLMs).

Key guarantees: Reproducible, resumable, inspectable pipelines. Every stage checkpoints to disk; crashes resume from the last completed stage.

Layered Architecture

CLI Layer           (cli/)           User-facing commands
    |
Pipeline Layer      (pipeline/)      Orchestration, execution, GC
    |
Operator Layer      (operators/)     52 built-in transformations
    |
Schema Layer        (schema/)        Pydantic v2 data models

Dependency rule: Each layer depends only on layers below it. operators/ imports from schema/ only; RunContext is imported under TYPE_CHECKING to avoid circular deps.

Data Model (Schema Layer)

All types are Pydantic v2 BaseModel subclasses with extra="forbid".

Recording

Physical audio resource. Immutable after creation.

Recording
  id: str
  sources: list[AudioSource]       # type: file|url|command
  sampling_rate: int
  num_samples: int
  duration: float
  num_channels: int
  checksum: str | None
  custom: dict[str, Any]

Supervision

Time-aligned annotation over a Recording. Fields are progressively filled: VAD creates supervisions without text; ASR adds text later.

Supervision
  id, recording_id, start, duration    # required
  text, language, speaker, gender      # optional (filled by operators)
  channel, age_range, custom           # optional

Cut

The unit flowing through pipelines. References a [start, start+duration) slice of a Recording plus overlapping Supervisions. Immutable -- operators produce new Cuts, never mutate existing ones.

Cut
  id, recording_id, start, duration    # identity
  recording: Recording | None         # embedded for audio operators
  supervisions: list[Supervision]
  metrics: dict[str, float]            # snr, cer, utmos, ...
  custom: dict[str, Any]               # embeddings, tokens, ...
  provenance: Provenance               # lineage tracking

CutSet

Collection of Cuts. Serialized as gzip-compressed JSONL (.jsonl.gz). First line is a header with schema version.

Operations: filter(), map(), split(n), merge(), lazy iteration.

Provenance

Every Cut records how it was produced:

Provenance
  source_cut_id: str | None       # parent Cut (None for ingested)
  generated_by: str                # e.g. "silero_vad"
  stage_name: str                  # e.g. "02_vad"
  created_at: datetime
  pipeline_run_id: str

Enables vkit inspect trace <cut-id> to reconstruct the full lineage chain.

Operator System

Base Contract

class Operator(ABC):
    name: ClassVar[str]                         # unique registry key
    config_cls: ClassVar[type[OperatorConfig]]   # Pydantic config model
    device: ClassVar[Literal["cpu", "gpu"]]      # execution target
    produces_audio: ClassVar[bool]               # creates new WAV files?
    reads_audio_bytes: ClassVar[bool]            # needs raw audio samples?
    required_extras: ClassVar[list[str]]         # pip extras needed

    def setup(self) -> None: ...       # load models (once per worker)
    def process(self, cuts: CutSet) -> CutSet: ...  # transform
    def teardown(self) -> None: ...    # release resources

Registration

Built-in operators: @register_operator decorator + import in operators/__init__.py. Third-party: entry_points group voxkitchen.operators (lazy discovery on first access).

Optional deps wrapped in try/except ImportError -- missing packages don't crash the core.

Operator Catalog (52 operators, 8 categories)

Category	Count	Operators
Audio	5	`resample`, `ffmpeg_convert`, `channel_merge`, `loudness_normalize`, `identity`
Segmentation	4	`silero_vad`, `webrtc_vad`, `fixed_segment`, `silence_split`
Augmentation	4	`speed_perturb`, `volume_perturb`, `noise_augment`, `reverb_augment`
Synthesize	4	`tts_kokoro`, `tts_chattts`, `tts_cosyvoice`, `tts_fish_speech`
Annotation	17	`faster_whisper_asr`, `whisper_openai_asr`, `whisperx_asr`, `paraformer_asr`, `sensevoice_asr`, `wenet_asr`, `qwen3_asr`, `pyannote_diarize`, `speechbrain_langid`, `whisper_langid`, `gender_classify`, `speaker_embed`, `speech_enhance`, `forced_align`, `emotion_recognize`, `codec_tokenize`, `mel_extract`
Quality	11	`snr_estimate`, `dnsmos_score`, `utmos_score`, `pitch_stats`, `clipping_detect`, `bandwidth_estimate`, `duration_filter`, `audio_fingerprint_dedup`, `quality_score_filter`, `speaker_similarity`, `cer_wer`
Pack	6	`pack_manifest`, `pack_jsonl`, `pack_huggingface`, `pack_webdataset`, `pack_parquet`, `pack_kaldi`

Field Contracts

Every operator declares four ClassVar lists that describe which Cut fields it reads and writes. The pipeline pre-flight validator uses these declarations to catch wiring errors before any data is processed.

ClassVar	Meaning
`reads`	Fields that must be present; stage errors out if absent.
`writes`	Fields this operator populates or updates.
`optional_reads`	Fields consumed when present; a warning is emitted if absent.
`clears`	Fields this operator removes (e.g. VAD re-segmentation resets supervisions).

For contracts that depend on config values, operators implement dynamic_reads(self) -> list[str] instead of (or in addition to) reads. Config is read via self.config inside the method. quality_score_filter uses this: it inspects the conditions list at pre-flight time and returns the metric tokens those conditions reference.

Field vocabulary — the recognised token set:

Token pattern	Maps to
`audio`	Raw audio samples (waveform access).
`supervisions.text`	`Supervision.text` across all supervisions.
`supervisions.language`	`Supervision.language`.
`supervisions.speaker`	`Supervision.speaker`.
`supervisions.gender`	`Supervision.gender`.
`metrics.<name>`	`Cut.metrics["<name>"]` (e.g. `metrics.snr`).
`custom.<key>`	`Cut.custom["<key>"]` (e.g. `custom.word_alignments`).
`metrics.` / `custom.`	Namespace wildcard, used in `clears`.

Intrinsic fields (duration, start, channel) are not tracked — every stage can rely on them unconditionally.

Examples: - snr_estimate: reads=[audio], writes=[metrics.snr] - faster_whisper_asr: reads=[audio], writes=[supervisions.text, supervisions.language] - forced_align: reads=[audio, supervisions.text], writes=[custom.word_alignments, custom.forced_align_model] - silero_vad: reads=[audio], clears=[supervisions.text, supervisions.language, supervisions.speaker, supervisions.gender, metrics.*]

Operator Patterns

Analysis operator (e.g., snr_estimate): reads audio, writes to metrics. produces_audio=False, reads_audio_bytes=True.

Audio-producing operator (e.g., resample, speech_enhance): reads audio, writes new WAV to derived/, creates new Recording + Cut. produces_audio=True, reads_audio_bytes=True.

Text-only operator (e.g., cer_wer): reads from supervision.text or custom. No audio access. reads_audio_bytes=False.

TTS synthesis operator (e.g., tts_kokoro): reads text from supervision.text, generates audio in derived/. produces_audio=True, reads_audio_bytes=False.

Pipeline Engine

YAML Spec

version: "0.1"
name: my-pipeline
work_dir: ./work/${name}-${run_id}     # variable interpolation
num_gpus: 1
num_cpu_workers: null                   # auto-detect
gc_mode: aggressive                     # aggressive | keep

ingest:
  source: dir | manifest | recipe
  args: { root: ./data }

stages:
  - name: resample
    op: resample
    args: { target_sr: 16000 }
  - name: vad
    op: silero_vad
    args: { threshold: 0.5 }

Supports ${name}, ${run_id}, ${env:VAR} interpolation in all string values.

Execution Flow

vkit docker run pipeline.yaml
  |
  v
[Docker wrapper] mounts data/work/output, selects image, calls image entrypoint
  |
  v
[Loader] YAML -> PipelineSpec (Pydantic validation + interpolation)
  |
  v
[Runner] Resume check -> find last completed stage
  |
  v
[Ingest] DirScan | Manifest | Recipe -> initial CutSet
  |
  v
[Stage Loop]
  For each stage:
    1. Instantiate operator + config
    2. Select executor (CPU pool or GPU pool)
    3. Shard CutSet across workers
    4. Workers: setup() -> process() -> teardown()
    5. Write cuts.jsonl.gz + _SUCCESS marker
    6. Run GC on expired derived audio
  |
  v
[Finalize] Generate report, empty trash

Work Directory Layout

work_dir/
  run.yaml                    # spec snapshot
  00_resample/
    cuts.jsonl.gz             # output manifest
    _SUCCESS                  # completion marker
    _errors.jsonl             # per-cut errors (if any)
    _stats.json               # timing, throughput
    derived/                  # new audio files (if produces_audio)
  01_vad/
    cuts.jsonl.gz
    _SUCCESS
    ...
  derived_trash/              # GC'd audio (emptied on success)

Executors

CpuPoolExecutor: multiprocessing.Pool with spawn context. Shards CutSet, runs operator per shard. Config passed as JSON (not pickled) for cross-process safety.

GpuPoolExecutor: Spawns N subprocesses, each pinned to one GPU via CUDA_VISIBLE_DEVICES=i before torch import. Operator sees cuda:0.

Operators with parallelizable = False run once over the full CutSet. This is used for batch exporters such as pack_huggingface, where multiple workers would otherwise write the same output directory.

Error handling: If a sharded stage fails, retries cut-by-cut. Bad cuts are logged to _errors.jsonl, and a clean rerun removes stale error files. Batch stages with parallelizable = False fail atomically instead of falling back to per-cut retries.

Resume & Checkpointing

A stage is complete iff both cuts.jsonl.gz and _SUCCESS exist. _SUCCESS is written atomically after the manifest is fully flushed.

vkit docker run pipeline.yaml                    # full run
vkit docker run pipeline.yaml --resume-from vad  # resume from stage

Garbage Collection

Static analysis builds a GC plan: for each produces_audio=True stage, find its last downstream consumer (reads_audio_bytes=True). After that consumer completes, move the producer's derived/ to trash. Trash emptied only on successful pipeline completion.

gc_mode: keep or --keep-intermediates disables GC.

Ingest Sources

Source	Input	Use Case
`dir`	Directory of audio files	Raw audio processing
`manifest`	Existing `cuts.jsonl.gz`	Resume / chain pipelines
`recipe`	Named dataset parser	Standard datasets

Built-in Recipes

librispeech -- LibriSpeech ASR corpus (English read audiobooks, 960h)
libritts -- LibriTTS, multi-speaker English TTS (LibriSpeech-derived, sentence-segmented + TTS-normalized)
ljspeech -- LJSpeech-1.1, single-speaker English TTS baseline (24h)
aishell -- AISHELL-1 Mandarin read ASR (170h)
aishell3 -- AISHELL-3 multi-speaker Mandarin TTS (218 speakers, 85h)
cnceleb -- CN-Celeb 1 Chinese speaker recognition (1000 speakers, 130k utts, 11 genres)
commonvoice -- Mozilla Common Voice (multilingual, manual download)
fleurs -- Google FLEURS multilingual (102 languages, ~12h/lang)
musan -- MUSAN augmentation source (~11 GB of noise / music / speech, non-transcribed)

Each recipe implements download() and prepare(root, subsets, ctx) -> CutSet.

CLI Commands

Host-recommended commands (the supported path for pipx install voxkitchen users):

Command	Purpose
`vkit init <path> [-t template]`	Scaffold a project directory
`vkit validate <yaml>`	Validate YAML; print recommended image
`vkit docker pull --tag <tag>`	Pull a prebuilt runtime image
`vkit docker run <yaml>`	Execute pipeline inside a prebuilt image
`vkit docker download <recipe>`	Download dataset inside a prebuilt image
`vkit docker doctor` / `vkit doctor`	Per-env operator availability report
`vkit docker shell`	Open an interactive bash inside an image
`vkit docker build [target]`	Build a Docker image locally

Browse and inspect (read-only, host-safe):

Command	Purpose
`vkit operators [--category <cat>]`	List operators, optionally filtered
`vkit operators search <keyword>`	Find operators by name or one-line summary
`vkit operators show <name>`	Operator detail (args, device, image hint)
`vkit recipes`	List dataset recipes
`vkit schema export [--out PATH]`	Generate `pipeline.schema.json` for editors
`vkit inspect run <dir>`	Stage summary for a run
`vkit inspect cuts <path>`	Cut statistics for a manifest
`vkit inspect trace <id> --in <dir>`	Provenance chain for a cut
`vkit inspect errors <dir>`	Per-stage error report
`vkit viz <manifest>`	Launch the Gradio explorer

Container / dev entrypoints (run inside an image, or with VKIT_ALLOW_LOCAL_RUN=1 for local debugging):

Command	Purpose
`vkit run <yaml>`	Pipeline entrypoint used inside the image
`vkit download <recipe>`	Current-env dataset download helper
`vkit ingest --source <dir\\|manifest\\|recipe>`	Standalone manifest builder

The three commands above warn when invoked from a bare host install, pointing to the recommended vkit docker … alternative.

Templates: tts, asr, cleaning, speaker (stored in voxkitchen/templates/pipelines/, with editable examples in examples/pipelines/).

Python Tools API

For one-off tasks without writing YAML:

from voxkitchen.tools import (
    transcribe, detect_speech, estimate_snr,
    extract_speaker_embedding, enhance_speech,
    align_words, synthesize,
)

transcribe("speech.wav", model="large-v3")
detect_speech("speech.wav", method="silero")
estimate_snr("speech.wav")
extract_speaker_embedding("speaker.wav")
enhance_speech("noisy.wav", "clean.wav")
align_words("speech.wav", "hello world")
synthesize("Hello!", "output.wav", engine="kokoro")

Each function creates a temporary Cut + RunContext, runs the corresponding operator, and returns the result.

Runtime Images

User pipeline execution is Docker-based. The host vkit command is a lightweight launcher; operator dependencies live in prebuilt images.

Image env groups:

Group	Packages	For
`audio`	torch, torchaudio	Resample, VAD
`asr`	faster-whisper	ASR transcription
`segment`	webrtcvad, librosa	Speech segmentation
`classify`	speechbrain	Speaker/language classifiers and speaker embeddings
`diarize`	pyannote.audio	Speaker diarization
`enhance`	deepfilternet	Speech denoising
`align`	qwen-asr	Forced alignment
`codec`	encodec, dac	Neural codec tokens
`tts-kokoro`	kokoro, misaki	Kokoro TTS (CPU)
`tts-chattts`	ChatTTS	ChatTTS (GPU)
`tts-cosyvoice`	modelscope	CosyVoice2 (GPU)
`tts-fish-speech`	fish-speech	Fish-Speech (GPU)
`viz`	jinja2, plotly	HTML report
`viz-panel`	gradio	Interactive panel

Project Structure

voxkitchen/
  cli/                  # Typer CLI app
  operators/            # 52 operators across 8 categories
    basic/              #   resample, ffmpeg_convert, ...
    segment/            #   silero_vad, webrtc_vad, ...
    augment/            #   speed_perturb, noise_augment, ...
    synthesize/         #   tts_kokoro, tts_chattts, ...
    annotate/           #   faster_whisper_asr, speaker_embed, ...
    quality/            #   snr_estimate, cer_wer, ...
    pack/               #   pack_jsonl, pack_huggingface, ...
    noop/               #   identity
  pipeline/             # Runner, executors, checkpoint, GC
  schema/               # Cut, CutSet, Recording, Supervision, Provenance
  ingest/               # DirScan, Manifest, Recipe sources
    recipes/            #   librispeech, libritts, ljspeech, aishell, aishell3, cnceleb, commonvoice, fleurs, musan
  viz/                  # HTML report + Gradio panel
  templates/            # vkit init template registry
  plugins/              # Entry-point discovery
  utils/                # Audio I/O, time, download helpers
  tools.py              # Standalone tool functions

examples/pipelines/     # 20+ example YAML pipelines
tests/unit/             # 297+ unit tests
tests/integration/      # End-to-end pipeline tests

Design Principles

Immutability -- Operators create new Cuts, never mutate existing ones
Resumability -- All state on disk; _SUCCESS markers enable crash recovery
Error tolerance -- Bad Cuts logged and skipped, pipeline continues
Docker-first runtime -- Heavy deps (torch, transformers) live in images
Provenance -- Every Cut tracks its lineage
Declarative first -- YAML is the primary interface; Python API is the escape hatch
Simplicity over efficiency -- Resolve conflicts in favor of simplicity

Roadmap

Completed

Core framework (schema, pipeline engine, CLI)
52 operators across 8 categories
9 ingest recipes (LibriSpeech, LibriTTS, LJSpeech, AISHELL-1, AISHELL-3, CN-Celeb 1, CommonVoice, FLEURS, MUSAN)
Visualization (Rich CLI, HTML report, Gradio panel)
Plugin system (entry_points)
TTS synthesis (Kokoro, ChatTTS, CosyVoice2, Fish-Speech)

Planned

Distributed execution -- Ray/Dask backend for multi-node pipelines
Cloud storage -- S3/GCS as audio source and output target
Dataset versioning -- DVC integration for manifest version control
Additional recipes -- GigaSpeech, WenetSpeech, MLS, VoxCeleb
Streaming pipelines -- Process audio streams without full materialization
Training integration -- Direct export to training frameworks (NeMo, ESPnet, WeNet)