Operator Reference

VoxKitchen ships with 52 built-in operators across 8 categories.

Tip

Run vkit operators to see this list in your terminal, or vkit operators show <name> for details.

Audio Processing

`channel_merge`

Merge multi-channel audio into mono or a specified number of channels.

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: Yes

Parameter	Type	Default	Description
`target_channels`	int	`1`

- name: my_channel_merge
  op: channel_merge
  args:
    target_channels: 1

`ffmpeg_convert`

Convert audio format using ffmpeg (e.g. opus to wav, flac to mp3).

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: Yes

Parameter	Type	Default	Description
`target_format`	str	`wav`
`clean_names`	bool	`True`

- name: my_ffmpeg_convert
  op: ffmpeg_convert
  args:
    target_format: wav
    clean_names: true

`loudness_normalize`

Normalize audio loudness to a target LUFS level (EBU R 128).

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: Yes

Parameter	Type	Default	Description
`target_lufs`	float	`-23.0`

- name: my_loudness_normalize
  op: loudness_normalize
  args:
    target_lufs: -23.0

`resample`

Resample audio to a target sample rate and channel count.

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: Yes

Parameter	Type	Default	Description
`target_sr`	int	required
`target_channels`	int \| None	`None`

- name: my_resample
  op: resample
  args:
    target_sr: <int>
    target_channels: null

Segmentation

`fixed_segment`

Split each input Cut into fixed-length child Cuts.

This is a 1-to-many operator: one Cut in, N Cuts out. Each child shares the parent's recording and recording_id — no new audio is written. The child's start is offset within the parent's audio file, so playback of child.recording from child.start for child.duration seconds yields the correct audio slice.

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: No

Parameter	Type	Default	Description
`segment_duration`	float	`10.0`
`min_remaining`	float	`0.5`

- name: my_fixed_segment
  op: fixed_segment
  args:
    segment_duration: 10.0
    min_remaining: 0.5

`silence_split`

Split each Cut on silent regions using librosa.effects.split.

Returns one child Cut per non-silent interval. No new audio is written.

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: No

Parameter	Type	Default	Description
`top_db`	int	`30`
`min_duration`	float	`0.1`

- name: my_silence_split
  op: silence_split
  args:
    top_db: 30
    min_duration: 0.1

`silero_vad`

Detect speech regions using Silero VAD and emit one child Cut per region.

Loads the Silero VAD model via torch.hub (cached after first download). Works on both GPU and CPU. Requires network on first run to download the model (~2 MB). Use webrtc_vad or silence_split if torch is not available.

Device: gpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: No

Parameter	Type	Default
`threshold`	float	`0.5`
`min_speech_duration_ms`	int	`250`
`min_silence_duration_ms`	int	`100`
`speech_pad_ms`	int	`30`

- name: my_silero_vad
  op: silero_vad
  args:
    threshold: 0.5
    min_speech_duration_ms: 250
    min_silence_duration_ms: 100
    speech_pad_ms: 30

`webrtc_vad`

Detect speech regions using webrtcvad and emit one child Cut per region.

Reads audio bytes from the parent Cut, runs frame-by-frame VAD, merges consecutive speech frames, applies minimum-duration and padding, then creates child Cuts for each speech region. No new audio is written.

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: No

Parameter	Type	Default
`aggressiveness`	int	`2`
`frame_duration_ms`	int	`30`
`min_speech_duration_ms`	int	`250`
`padding_ms`	int	`30`

- name: my_webrtc_vad
  op: webrtc_vad
  args:
    aggressiveness: 2
    frame_duration_ms: 30
    min_speech_duration_ms: 250
    padding_ms: 30

Data Augmentation

`noise_augment`

Mix audio with random noise files at a random SNR.

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: Yes

Parameter	Type	Default	Description
`noise_dir`	str	required
`snr_range`	list[float]	`[5.0, 20.0]`

- name: my_noise_augment
  op: noise_augment
  args:
    noise_dir: <str>
    snr_range: [5.0, 20.0]

`reverb_augment`

Add room reverb by convolving with Room Impulse Response files.

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: Yes

Parameter	Type	Default	Description
`rir_dir`	str	required
`normalize`	bool	`True`

- name: my_reverb_augment
  op: reverb_augment
  args:
    rir_dir: <str>
    normalize: true

`speed_perturb`

Apply speed perturbation (tempo + pitch change) via resampling.

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: Yes

Parameter	Type	Default	Description
`factors`	list[float]	`[0.9, 1.0, 1.1]`

- name: my_speed_perturb
  op: speed_perturb
  args:
    factors: [0.9, 1.0, 1.1]

`volume_perturb`

Apply random volume gain within a dB range.

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: Yes

Parameter	Type	Default	Description
`min_gain_db`	float	`-6.0`
`max_gain_db`	float	`6.0`

- name: my_volume_perturb
  op: volume_perturb
  args:
    min_gain_db: -6.0
    max_gain_db: 6.0

Annotation

`codec_tokenize`

Encode audio into discrete codec tokens (EnCodec / DAC).

Device: gpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: No

Parameter	Type	Default
`backend`	str	`encodec`
`bandwidth`	float	`6.0`
`model`	str	`encodec_24khz`

- name: my_codec_tokenize
  op: codec_tokenize
  args:
    backend: encodec
    bandwidth: 6.0
    model: encodec_24khz

`emotion_recognize`

Recognize speech emotions using emotion2vec (9 classes: angry, happy, sad, ...).

Device: gpu
Runtime: vkit docker run --tag asr <yaml>
Produces audio: No

Parameter	Type	Default	Description
`model`	str	`iic/emotion2vec_plus_large`
`granularity`	str	`utterance`

- name: my_emotion_recognize
  op: emotion_recognize
  args:
    model: iic/emotion2vec_plus_large
    granularity: utterance

`faster_whisper_asr`

Transcribe audio using faster-whisper and add Supervisions with text + language.

Uses CTranslate2 for inference. Has GPU and CPU, but on CPU the compute_type is coerced to "int8" because float16 is not supported.

Warning

CTranslate2 may deadlock on macOS ARM64. Use whisper_openai_asr on macOS instead.

Device: gpu
Runtime: vkit docker run --tag asr <yaml>
Produces audio: No

Parameter	Type	Default
`model`	str	`tiny`
`language`	str \| None	`None`
`beam_size`	int	`5`
`compute_type`	str	`int8`
`cpu_threads`	int	`4`

- name: my_faster_whisper_asr
  op: faster_whisper_asr
  args:
    model: tiny
    language: null
    beam_size: 5
    compute_type: int8
    cpu_threads: 4

`forced_align`

Align text to audio at word level using Qwen3-ForcedAligner (11 languages).

Device: gpu
Runtime: vkit docker run --tag asr <yaml>
Produces audio: No

Parameter	Type	Default	Description
`model`	str	`Qwen/Qwen3-ForcedAligner-0.6B`
`language`	str	`Chinese`

- name: my_forced_align
  op: forced_align
  args:
    model: Qwen/Qwen3-ForcedAligner-0.6B
    language: Chinese

`gender_classify`

Classify speaker gender using one of several methods.

Methods:

f0: Extract fundamental frequency via librosa's pyin. Male if median F0 < threshold (default 165 Hz), else female. Fast, no model download, but only ~80-85% accurate on clean adult speech. Fails on children, elderly, or noisy audio.
speechbrain: Use a SpeechBrain EncoderClassifier. More accurate (~95%+) but requires model download. Default model is a speaker recognition model (placeholder) — override speechbrain_model with a true gender classifier for best results.
inaspeechsegmenter: Use INA's speech segmenter which jointly detects speech/music/noise and classifies gender. Well-tested in broadcast media analysis (~90-95%). This backend is not included in the published Docker images because it pulls in TensorFlow.
Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: No

Parameter	Type	Default
`method`	str	`f0`
`f0_threshold`	float	`165.0`
`speechbrain_model`	str	`speechbrain/spkrec-ecapa-voxceleb`

- name: my_gender_classify
  op: gender_classify
  args:
    method: f0
    f0_threshold: 165.0
    speechbrain_model: speechbrain/spkrec-ecapa-voxceleb

`mel_extract`

Extract mel spectrogram and save as .npy file per cut.

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: No

Parameter	Type	Default
`n_fft`	int	`1024`
`hop_length`	int	`256`
`n_mels`	int	`80`
`fmin`	float	`0.0`
`fmax`	float \| None	`8000.0`
`ref_db`	float	`20.0`
`output_dir`	str \| None	`None`

- name: my_mel_extract
  op: mel_extract
  args:
    n_fft: 1024
    hop_length: 256
    n_mels: 80
    fmin: 0.0
    fmax: 8000.0
    ref_db: 20.0
    output_dir: null

`normalize_text`

Normalize supervisions[].text in place: strip tags, collapse spaces.

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: No

Parameter	Type	Default
`strip_tags`	bool	`True`
`collapse_spaces`	bool	`True`
`lowercase`	bool	`False`

- name: my_normalize_text
  op: normalize_text
  args:
    strip_tags: true
    collapse_spaces: true
    lowercase: false

`paraformer_asr`

Transcribe audio using Paraformer (FunASR).

The default model includes built-in VAD and punctuation restoration, making it suitable for long-form audio without pre-segmentation. Much faster than Whisper for Chinese.

Device: gpu
Runtime: vkit docker run --tag asr <yaml>
Produces audio: No

Parameter	Type	Default	Description
`model`	str	`iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch`
`language`	str	`zh`

- name: my_paraformer_asr
  op: paraformer_asr
  args:
    model: iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch
    language: zh

`pyannote_diarize`

Add speaker-label Supervisions to each Cut using pyannote.audio.

Requires accepting the pyannote model user agreement on HuggingFace and setting HF_TOKEN (or passing hf_token in the config).

Device: gpu
Runtime: vkit docker run --tag diarize <yaml>
Produces audio: No

Parameter	Type	Default
`model`	str	`pyannote/speaker-diarization-3.1`
`min_speakers`	int \| None	`None`
`max_speakers`	int \| None	`None`
`hf_token`	str \| None	`None`

- name: my_pyannote_diarize
  op: pyannote_diarize
  args:
    model: pyannote/speaker-diarization-3.1
    min_speakers: null
    max_speakers: null
    hf_token: null

`qwen3_asr`

Transcribe audio using Qwen3-ASR.

30 languages + 22 Chinese dialects. Set return_timestamps=True to also get word-level timestamps (uses ForcedAligner internally).

Device: gpu
Runtime: vkit docker run --tag asr <yaml>
Produces audio: No

Parameter	Type	Default
`model`	str	`Qwen/Qwen3-ASR-0.6B`
`language`	str \| None	`None`
`return_timestamps`	bool	`False`
`aligner_model`	str	`Qwen/Qwen3-ForcedAligner-0.6B`
`max_new_tokens`	int	`512`

- name: my_qwen3_asr
  op: qwen3_asr
  args:
    model: Qwen/Qwen3-ASR-0.6B
    language: null
    return_timestamps: false
    aligner_model: Qwen/Qwen3-ForcedAligner-0.6B
    max_new_tokens: 512

`sensevoice_asr`

Transcribe audio using SenseVoice (FunASR).

SenseVoice supports Chinese, English, Japanese, Korean, and Cantonese. The SenseVoiceSmall model is fast and accurate for these languages.

In addition to the transcript, each Supervision carries: - supervision.language — model-detected language code - supervision.custom["emotion"] — per-utterance emotion label - supervision.custom["audio_event"] — "Speech" / "BGM" / "noise"

Device: gpu
Runtime: vkit docker run --tag asr <yaml>
Produces audio: No

Parameter	Type	Default	Description
`model`	str	`iic/SenseVoiceSmall`
`language`	str	`auto`

- name: my_sensevoice_asr
  op: sensevoice_asr
  args:
    model: iic/SenseVoiceSmall
    language: auto

`speaker_embed`

Extract speaker embedding vectors using SpeechBrain or WeSpeaker.

Official VoxKitchen Docker images support method="speechbrain". method="wespeaker" is kept for custom environments with WeSpeaker installed.

Device: gpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: No

Parameter	Type	Default
`method`	str	`speechbrain`
`wespeaker_model`	str	`english`
`speechbrain_model`	str	`speechbrain/spkrec-ecapa-voxceleb`

- name: my_speaker_embed
  op: speaker_embed
  args:
    method: speechbrain
    wespeaker_model: english
    speechbrain_model: speechbrain/spkrec-ecapa-voxceleb

`speech_enhance`

Remove background noise using DeepFilterNet neural denoiser.

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: Yes

Parameter	Type	Default	Description
`method`	str	`deepfilternet`
`aggressiveness`	float	`0.5`

- name: my_speech_enhance
  op: speech_enhance
  args:
    method: deepfilternet
    aggressiveness: 0.5

`speechbrain_langid`

Add a language-identification Supervision to each Cut using SpeechBrain.

Uses the VoxLingua107 ECAPA-TDNN model by default. Runs on CPU with automatic fallback from CUDA.

Device: gpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: No

Parameter	Type	Default	Description
`model`	str	`speechbrain/lang-id-voxlingua107-ecapa`

- name: my_speechbrain_langid
  op: speechbrain_langid
  args:
    model: speechbrain/lang-id-voxlingua107-ecapa

`wenet_asr`

Transcribe audio using WeNet.

WeNet supports streaming and non-streaming decoding. This operator uses non-streaming (offline) mode for best accuracy.

Device: gpu
Runtime: vkit docker run --tag asr <yaml>
Produces audio: No

Parameter	Type	Default	Description
`model`	str	`chinese`
`language`	str	`zh`

- name: my_wenet_asr
  op: wenet_asr
  args:
    model: chinese
    language: zh

`whisper_langid`

Detect the spoken language of each cut using Whisper.

Adds a Supervision with the detected language code (e.g., "en", "zh", "ja"). Uses only the first 30 seconds for detection — fast even on long recordings.

Backend selection (backend config): - auto: prefer faster-whisper, fall back to openai-whisper - openai: use openai-whisper (macOS-safe) - faster-whisper: use faster-whisper (faster on GPU)

Device: gpu
Runtime: vkit docker run --tag asr <yaml>
Produces audio: No

Parameter	Type	Default	Description
`model`	str	`tiny`
`backend`	str	`auto`

- name: my_whisper_langid
  op: whisper_langid
  args:
    model: tiny
    backend: auto

`whisper_openai_asr`

Transcribe audio using OpenAI's official whisper (pure PyTorch).

Works on both CPU and GPU. On CPU, set fp16: false. Auto-detects CUDA and falls back to CPU transparently.

This is the recommended ASR operator for macOS where CTranslate2-based operators (faster_whisper_asr) may deadlock.

Device: gpu
Runtime: vkit docker run --tag asr <yaml>
Produces audio: No

Parameter	Type	Default
`model`	str	`tiny`
`language`	str \| None	`None`
`beam_size`	int	`5`
`fp16`	bool	`True`

- name: my_whisper_openai_asr
  op: whisper_openai_asr
  args:
    model: tiny
    language: null
    beam_size: 5
    fp16: true

`whisperx_asr`

Transcribe audio with word-level alignment using whisperx.

If whisperx is not installed, falls back to faster-whisper at segment level (no word alignment). Both paths are packaged in the ASR Docker runtime.

Device: gpu
Runtime: vkit docker run --tag asr <yaml>
Produces audio: No

Parameter	Type	Default
`model`	str	`tiny`
`language`	str \| None	`None`
`batch_size`	int	`8`
`compute_type`	str	`int8`

- name: my_whisperx_asr
  op: whisperx_asr
  args:
    model: tiny
    language: null
    batch_size: 8
    compute_type: int8

Quality & Filtering

`audio_fingerprint_dedup`

Remove near-duplicate cuts using MFCC mean features + simhash.

For each cut, a 13-coefficient MFCC mean vector is computed and hashed with simhash. Cuts whose hash is within similarity_threshold bits (hamming distance) of any previously seen hash are dropped as duplicates.

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: No

Parameter	Type	Default	Description
`similarity_threshold`	int	`3`

- name: my_audio_fingerprint_dedup
  op: audio_fingerprint_dedup
  args:
    similarity_threshold: 3

`bandwidth_estimate`

Estimate effective audio bandwidth and store in metrics.

Detects files that were upsampled from a lower sample rate — e.g., an 8 kHz telephone recording saved as 48 kHz WAV will show bandwidth_khz ≈ 4.0.

Computes STFT, measures mean power per frequency bin, then finds the frequency where energy drops sharply (ratio method). Writes: - metrics["bandwidth_khz"]: effective bandwidth in kHz

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: No

Parameter	Type	Default	Description
`nfft`	int	`512`
`hop`	int	`256`

- name: my_bandwidth_estimate
  op: bandwidth_estimate
  args:
    nfft: 512
    hop: 256

`cer_wer`

Compute CER and WER between ASR output and reference text.

Reference text is read from cut.custom[reference_field] (default key: "reference_text"). Cuts without a reference are passed through unchanged.

With normalize=True (default) both hypothesis and reference are normalized before comparison: - SenseVoice <|zh|><|HAPPY|>… tags are stripped - Paraformer's inter-character spaces are removed - Punctuation is discarded - Text is lowercased This makes CER directly comparable across different ASR backends.

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: No

Parameter	Type	Default
`hypothesis_field`	str	`text`
`reference_field`	str	`reference_text`
`normalize`	bool	`True`

- name: my_cer_wer
  op: cer_wer
  args:
    hypothesis_field: text
    reference_field: reference_text
    normalize: true

`clipping_detect`

Detect audio clipping and store the ratio of clipped samples.

Clipping occurs when recording levels are too high, causing the waveform to be truncated at the maximum amplitude ceiling. This produces harsh distortion that degrades ASR and TTS training.

Writes metrics["clipping_ratio"] — fraction of samples whose absolute value exceeds ceiling (default 0.99). A ratio > 0.01 indicates significant clipping.

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: No

Parameter	Type	Default	Description
`ceiling`	float	`0.99`

- name: my_clipping_detect
  op: clipping_detect
  args:
    ceiling: 0.99

`dnsmos_score`

Score audio quality using Microsoft DNSMOS (no reference needed).

Writes four metrics: - dnsmos_ovrl — P.835 overall quality (1-5) - dnsmos_sig — P.835 speech signal quality (1-5) - dnsmos_bak — P.835 background noise quality (1-5) - dnsmos_p808 — P.808 overall MOS (1-5)

Higher is better. Typically dnsmos_ovrl > 3.0 is considered acceptable for training data.

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: No

Parameter	Type	Default	Description
`use_gpu`	bool	`False`

- name: my_dnsmos_score
  op: dnsmos_score
  args:
    use_gpu: false

`duration_filter`

Drop Cuts whose duration falls outside [min_duration, max_duration].

This is an N-to-fewer operator: no audio is read or written.

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: No

Parameter	Type	Default	Description
`min_duration`	float	`0.0`
`max_duration`	float \| None	`None`

- name: my_duration_filter
  op: duration_filter
  args:
    min_duration: 0.0
    max_duration: null

`pitch_stats`

Compute pitch (F0) statistics using PyWorld (dio + stonemask).

More accurate than librosa.pyin for speech. Writes: - metrics["pitch_mean"] — mean F0 in Hz (voiced frames only) - metrics["pitch_std"] — normalized std (0-1 range, pitch-independent)

A pitch_mean of 0 means no voiced frames were detected.

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: No

Parameter	Type	Default
`f0_min`	float	`50.0`
`f0_max`	float	`2400.0`
`frame_period_ms`	float	`5.0`

- name: my_pitch_stats
  op: pitch_stats
  args:
    f0_min: 50.0
    f0_max: 2400.0
    frame_period_ms: 5.0

`quality_score_filter`

Drop Cuts that do not satisfy all conditions.

Each condition is a whitespace-separated triple field.path op value where op is one of >, >=, <, <=, ==, !=. All conditions are AND-ed together.

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: No

Parameter	Type	Default	Description
`conditions`	list[str]	required

- name: my_quality_score_filter
  op: quality_score_filter
  args:
    conditions: <list[str]>

`snr_estimate`

Estimate SNR via a peak-to-RMS ratio and store it in cut.metrics["snr"].

This is a rough proxy (not WADA-SNR or model-based) sufficient for v0.1. No audio is written; only the metrics dict is updated.

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: No

- name: my_snr_estimate
  op: snr_estimate

`speaker_similarity`

Score speaker similarity against a reference embedding (cosine).

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: No

Parameter	Type	Default	Description
`reference_path`	str	required
`embedding_key`	str	`speaker_embedding`

- name: my_speaker_similarity
  op: speaker_similarity
  args:
    reference_path: <str>
    embedding_key: speaker_embedding

`utmos_score`

Predict speech naturalness MOS using UTMOS (no reference needed).

Writes metrics["utmos"] — predicted MOS score (1-5). Higher is better. Scores > 4.0 indicate natural-sounding speech.

Useful for filtering synthetic/degraded audio from training data.

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: No

- name: my_utmos_score
  op: utmos_score

Synthesis

`tts_chattts`

Synthesize conversational speech using ChatTTS.

Device: gpu
Runtime: vkit docker run --tag tts <yaml>
Produces audio: Yes

Parameter	Type	Default
`seed`	int \| None	`None`
`temperature`	float	`0.3`
`top_p`	float	`0.7`
`top_k`	int	`20`

- name: my_tts_chattts
  op: tts_chattts
  args:
    seed: null
    temperature: 0.3
    top_p: 0.7
    top_k: 20

`tts_cosyvoice`

Synthesize speech using CosyVoice2 with optional voice cloning.

Device: gpu
Runtime: vkit docker run --tag tts <yaml>
Produces audio: Yes

Parameter	Type	Default
`model_id`	str	`FunAudioLLM/CosyVoice2-0.5B`
`mode`	str	`sft`
`spk_id`	str	`default`
`reference_audio`	str \| None	`None`
`reference_text`	str \| None	`None`

- name: my_tts_cosyvoice
  op: tts_cosyvoice
  args:
    model_id: FunAudioLLM/CosyVoice2-0.5B
    mode: sft
    spk_id: default
    reference_audio: null
    reference_text: null

`tts_fish_speech`

Synthesize speech using Fish-Speech codec language model.

Device: gpu
Runtime: vkit docker run --tag fish-speech <yaml>
Produces audio: Yes

Parameter	Type	Default
`model_id`	str	`fishaudio/s2-pro`
`reference_audio`	str \| None	`None`
`reference_text`	str \| None	`None`
`max_new_tokens`	int	`1024`
`top_p`	float	`0.8`
`temperature`	float	`0.8`
`repetition_penalty`	float	`1.1`
`chunk_length`	int	`200`
`seed`	int \| None	`None`
`compile`	bool	`False`
`half`	bool	`False`

- name: my_tts_fish_speech
  op: tts_fish_speech
  args:
    model_id: fishaudio/s2-pro
    reference_audio: null
    reference_text: null
    max_new_tokens: 1024
    top_p: 0.8
    temperature: 0.8
    repetition_penalty: 1.1
    chunk_length: 200
    seed: null
    compile: false
    half: false

`tts_kokoro`

Synthesize speech from text using Kokoro TTS.

Device: cpu
Runtime: vkit docker run --tag tts <yaml>
Produces audio: Yes

Parameter	Type	Default
`voice`	str	`af_heart`
`lang_code`	str	`a`
`speed`	float	`1.0`

- name: my_tts_kokoro
  op: tts_kokoro
  args:
    voice: af_heart
    lang_code: a
    speed: 1.0

Output / Packing

`pack_huggingface`

Export CutSet as a HuggingFace Dataset with audio column.

Warning

Recent HuggingFace datasets versions may require torchcodec when decoding Audio rows into arrays. Install torchcodec in the training environment, or cast with Audio(decode=False) when reading metadata/embedded audio bytes only.

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: Yes

Parameter	Type	Default	Description
`output_dir`	str \| None	`None`

- name: my_pack_huggingface
  op: pack_huggingface
  args:
    output_dir: null

`pack_jsonl`

Write a flat JSONL manifest — one JSON object per line.

Fields: id, origin_id, start, end, duration, sample_rate, text, snr, gender (male/female/unknown), speaker, language.

start/end are the VAD segment boundaries in the original recording. origin_id traces back to the source filename.

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: No

Parameter	Type	Default	Description
`output_path`	str \| None	`None`

- name: my_pack_jsonl
  op: pack_jsonl
  args:
    output_path: null

`pack_kaldi`

Export CutSet in Kaldi format (wav.scp, text, utt2spk, spk2utt).

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: No

Parameter	Type	Default	Description
`output_dir`	str \| None	`None`

- name: my_pack_kaldi
  op: pack_kaldi
  args:
    output_dir: null

`pack_manifest`

Write a flat manifest (cuts.jsonl.gz) with no audio export.

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: No

- name: my_pack_manifest
  op: pack_manifest

`pack_parquet`

Export CutSet as Apache Parquet with audio file references.

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: No

Parameter	Type	Default	Description
`output_dir`	str \| None	`None`

- name: my_pack_parquet
  op: pack_parquet
  args:
    output_dir: null

`pack_webdataset`

Export CutSet as WebDataset tar shards with embedded audio.

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: Yes

Parameter	Type	Default	Description
`output_dir`	str \| None	`None`
`shard_size`	int	`1000`

- name: my_pack_webdataset
  op: pack_webdataset
  args:
    output_dir: null
    shard_size: 1000

Utility

`identity`

Pass cuts through unchanged (no-op, useful for testing).

Device: cpu
Runtime: vkit docker run --tag slim <yaml>
Produces audio: No

- name: my_identity
  op: identity

Operator Reference

Categories

Audio Processing

channel_merge

ffmpeg_convert

loudness_normalize

resample

Segmentation

fixed_segment

silence_split

silero_vad

webrtc_vad

Data Augmentation

noise_augment

reverb_augment

speed_perturb

volume_perturb

Annotation

codec_tokenize

emotion_recognize

faster_whisper_asr

forced_align

gender_classify

mel_extract

normalize_text

paraformer_asr

pyannote_diarize

qwen3_asr

sensevoice_asr

speaker_embed

speech_enhance

speechbrain_langid

wenet_asr

whisper_langid

whisper_openai_asr

whisperx_asr