Skip to content

Examples & Use Cases

Use this page when you already know what kind of speech data task you want to run. Start with a template for normal projects; use the bundled example pipelines for quick checks, demos, and advanced operator combinations.

Quick Demo

The published Docker images include demo pipelines and demo audio, so you can try VoxKitchen without cloning the repository:

vkit docker run --tag slim examples/pipelines/demo-no-asr.yaml --dry-run
vkit docker run --tag slim examples/pipelines/demo-no-asr.yaml
vkit inspect run ./work/demo-no-asr

Use this path to check that Docker, mounts, checkpoints, reports, and inspect commands work on your machine.

Start From A Template

Templates are the recommended starting point for real projects because they create a local project directory with data/, pipeline.yaml, and a short README.

Goal Command Runtime image
Clean and filter raw speech audio vkit init my-cleaning --template cleaning slim
Build ASR training data vkit init my-asr --template asr asr
Analyze speakers and languages vkit init my-speakers --template speaker latest
Prepare TTS training data vkit init my-tts --template tts asr

Typical run:

cd my-asr
cp /path/to/audio/* data/
vkit docker run --tag asr pipeline.yaml --dry-run
vkit docker run --tag asr pipeline.yaml
vkit inspect run work/

Bundled Example Pipelines

These YAML files are available inside the published Docker images under examples/pipelines/. Clone the repository only if you want to inspect or edit the files locally.

Pipeline Use case Runtime image
minimal.yaml Identity passthrough to check the runner slim
demo-no-asr.yaml Small CPU-friendly demo with bundled audio slim
demo-full.yaml Full demo with VAD, quality metrics, ASR, gender, filtering asr
dir-resample-pack.yaml Directory ingest, resample, normalize, Kaldi export slim
data-cleaning.yaml Quality metrics, dedup, filtering, JSONL export slim
asr-training-data.yaml VAD, augmentation, ASR labeling, HuggingFace export asr
librispeech-asr.yaml Recipe ingest from LibriSpeech, ASR, quality filter asr
qwen3-asr.yaml Qwen3-ASR transcription path asr
forced-align.yaml Word alignment for existing text/audio asr
speaker-analysis.yaml VAD, diarization, speaker/language annotations latest
speaker-embed.yaml Extract speaker embeddings from speech segments slim
tts-data-prep.yaml Clean, segment, transcribe, align, and pack TTS data asr
tts-synthesis.yaml Run a TTS synthesis operator tts
tts-speaker-filter.yaml Filter or inspect data by speaker metadata slim
speech-enhance.yaml Speech enhancement / denoising slim
augmentation.yaml Basic data augmentation flow slim
noise-augment.yaml Add noise augmentation from a noise directory slim
reverb-augment.yaml Add reverberation augmentation slim
emotion-recognize.yaml Emotion annotation asr
codec-tokenize.yaml Audio codec token extraction slim
fleurs-multilingual.yaml Multilingual recipe-style processing asr

Validate before running when you are unsure about paths, arguments, or image choice:

vkit docker run --tag asr examples/pipelines/asr-training-data.yaml --dry-run

To edit these examples or write your own pipeline from scratch, see the Pipeline YAML reference.

Choosing The Right Image

Use the smallest image that contains the operators in your pipeline:

Image Choose it for
slim CPU-friendly prep, VAD, enhancement, quality checks, codec tokenization, packing
asr ASR, forced alignment, and emotion annotation
diarize Pyannote diarization only
tts TTS engines except Fish-Speech
fish-speech Fish-Speech isolated runtime
latest Mixed pipelines across ASR, diarization, TTS, and Fish-Speech

vkit validate pipeline.yaml prints the recommended pull/run command for your pipeline.

Inspect Outputs

Most examples write checkpoints and reports under ./work; exporters usually write final datasets under ./output.

vkit inspect run work/
vkit inspect cuts work/<run>/<stage>/cuts.jsonl.gz
vkit inspect errors work/

For pack_huggingface, the audio column is embedded in the Arrow dataset. If your training code needs decoded arrays, install torchcodec; for metadata or custom decoding, read with datasets.Audio(decode=False).