Examples & Use Cases

Use this page when you already know what kind of speech data task you want to run. Start with a template for normal projects; use the bundled example pipelines for quick checks, demos, and advanced operator combinations.

Quick Demo

The published Docker images include demo pipelines and demo audio, so you can try VoxKitchen without cloning the repository:

vkit docker run --tag slim examples/pipelines/demo-no-asr.yaml --dry-run
vkit docker run --tag slim examples/pipelines/demo-no-asr.yaml
vkit inspect run ./work/demo-no-asr

Use this path to check that Docker, mounts, checkpoints, reports, and inspect commands work on your machine.

Start From A Template

Templates are the recommended starting point for real projects because they create a local project directory with data/, pipeline.yaml, and a short README.

Goal	Command	Runtime image
Clean and filter raw speech audio	`vkit init my-cleaning --template cleaning`	`slim`
Build ASR training data	`vkit init my-asr --template asr`	`asr`
Analyze speakers and languages	`vkit init my-speakers --template speaker`	`latest`
Prepare TTS training data	`vkit init my-tts --template tts`	`asr`

Typical run:

cd my-asr
cp /path/to/audio/* data/
vkit docker run --tag asr pipeline.yaml --dry-run
vkit docker run --tag asr pipeline.yaml
vkit inspect run work/

Bundled Example Pipelines

These YAML files are available inside the published Docker images under examples/pipelines/. Clone the repository only if you want to inspect or edit the files locally.

Pipeline	Use case	Runtime image
`minimal.yaml`	Identity passthrough to check the runner	`slim`
`demo-no-asr.yaml`	Small CPU-friendly demo with bundled audio	`slim`
`demo-full.yaml`	Full demo with VAD, quality metrics, ASR, gender, filtering	`asr`
`dir-resample-pack.yaml`	Directory ingest, resample, normalize, Kaldi export	`slim`
`data-cleaning.yaml`	Quality metrics, dedup, filtering, JSONL export	`slim`
`asr-training-data.yaml`	VAD, augmentation, ASR labeling, HuggingFace export	`asr`
`librispeech-asr.yaml`	Recipe ingest from LibriSpeech, ASR, quality filter	`asr`
`qwen3-asr.yaml`	Qwen3-ASR transcription path	`asr`
`forced-align.yaml`	Word alignment for existing text/audio	`asr`
`speaker-analysis.yaml`	VAD, diarization, speaker/language annotations	`latest`
`speaker-embed.yaml`	Extract speaker embeddings from speech segments	`slim`
`tts-data-prep.yaml`	Clean, segment, transcribe, align, and pack TTS data	`asr`
`tts-synthesis.yaml`	Run a TTS synthesis operator	`tts`
`tts-speaker-filter.yaml`	Filter or inspect data by speaker metadata	`slim`
`speech-enhance.yaml`	Speech enhancement / denoising	`slim`
`augmentation.yaml`	Basic data augmentation flow	`slim`
`noise-augment.yaml`	Add noise augmentation from a noise directory	`slim`
`reverb-augment.yaml`	Add reverberation augmentation	`slim`
`emotion-recognize.yaml`	Emotion annotation	`asr`
`codec-tokenize.yaml`	Audio codec token extraction	`slim`
`fleurs-multilingual.yaml`	Multilingual recipe-style processing	`asr`

Validate before running when you are unsure about paths, arguments, or image choice:

vkit docker run --tag asr examples/pipelines/asr-training-data.yaml --dry-run

To edit these examples or write your own pipeline from scratch, see the Pipeline YAML reference.

Choosing The Right Image

Use the smallest image that contains the operators in your pipeline:

Image	Choose it for
`slim`	CPU-friendly prep, VAD, enhancement, quality checks, codec tokenization, packing
`asr`	ASR, forced alignment, and emotion annotation
`diarize`	Pyannote diarization only
`tts`	TTS engines except Fish-Speech
`fish-speech`	Fish-Speech isolated runtime
`latest`	Mixed pipelines across ASR, diarization, TTS, and Fish-Speech

vkit validate pipeline.yaml prints the recommended pull/run command for your pipeline.

Inspect Outputs

Most examples write checkpoints and reports under ./work; exporters usually write final datasets under ./output.

vkit inspect run work/
vkit inspect cuts work/<run>/<stage>/cuts.jsonl.gz
vkit inspect errors work/

For pack_huggingface, the audio column is embedded in the Arrow dataset. If your training code needs decoded arrays, install torchcodec; for metadata or custom decoding, read with datasets.Audio(decode=False).