Operator Reference
VoxKitchen ships with 52 built-in operators across 8 categories.
Tip
Run vkit operators to see this list in your terminal, or vkit operators show <name> for details.
Categories
- Audio Processing (4 operators)
- Segmentation (4 operators)
- Data Augmentation (4 operators)
- Annotation (18 operators)
- Quality & Filtering (11 operators)
- Synthesis (4 operators)
- Output / Packing (6 operators)
- Utility (1 operator)
Audio Processing
channel_merge
Merge multi-channel audio into mono or a specified number of channels.
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: Yes
| Parameter | Type | Default | Description |
|---|---|---|---|
target_channels |
int | 1 |
- name: my_channel_merge
op: channel_merge
args:
target_channels: 1
ffmpeg_convert
Convert audio format using ffmpeg (e.g. opus to wav, flac to mp3).
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: Yes
| Parameter | Type | Default | Description |
|---|---|---|---|
target_format |
str | wav |
|
clean_names |
bool | True |
- name: my_ffmpeg_convert
op: ffmpeg_convert
args:
target_format: wav
clean_names: true
loudness_normalize
Normalize audio loudness to a target LUFS level (EBU R 128).
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: Yes
| Parameter | Type | Default | Description |
|---|---|---|---|
target_lufs |
float | -23.0 |
- name: my_loudness_normalize
op: loudness_normalize
args:
target_lufs: -23.0
resample
Resample audio to a target sample rate and channel count.
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: Yes
| Parameter | Type | Default | Description |
|---|---|---|---|
target_sr |
int | required | |
target_channels |
int | None | None |
- name: my_resample
op: resample
args:
target_sr: <int>
target_channels: null
Segmentation
fixed_segment
Split each input Cut into fixed-length child Cuts.
This is a 1-to-many operator: one Cut in, N Cuts out. Each child shares
the parent's recording and recording_id — no new audio is written.
The child's start is offset within the parent's audio file, so playback
of child.recording from child.start for child.duration seconds
yields the correct audio slice.
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
segment_duration |
float | 10.0 |
|
min_remaining |
float | 0.5 |
- name: my_fixed_segment
op: fixed_segment
args:
segment_duration: 10.0
min_remaining: 0.5
silence_split
Split each Cut on silent regions using librosa.effects.split.
Returns one child Cut per non-silent interval. No new audio is written.
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
top_db |
int | 30 |
|
min_duration |
float | 0.1 |
- name: my_silence_split
op: silence_split
args:
top_db: 30
min_duration: 0.1
silero_vad
Detect speech regions using Silero VAD and emit one child Cut per region.
Loads the Silero VAD model via torch.hub (cached after first download).
Works on both GPU and CPU. Requires network on first run to download
the model (~2 MB). Use webrtc_vad or silence_split if torch
is not available.
- Device: gpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
threshold |
float | 0.5 |
|
min_speech_duration_ms |
int | 250 |
|
min_silence_duration_ms |
int | 100 |
|
speech_pad_ms |
int | 30 |
- name: my_silero_vad
op: silero_vad
args:
threshold: 0.5
min_speech_duration_ms: 250
min_silence_duration_ms: 100
speech_pad_ms: 30
webrtc_vad
Detect speech regions using webrtcvad and emit one child Cut per region.
Reads audio bytes from the parent Cut, runs frame-by-frame VAD, merges consecutive speech frames, applies minimum-duration and padding, then creates child Cuts for each speech region. No new audio is written.
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
aggressiveness |
int | 2 |
|
frame_duration_ms |
int | 30 |
|
min_speech_duration_ms |
int | 250 |
|
padding_ms |
int | 30 |
- name: my_webrtc_vad
op: webrtc_vad
args:
aggressiveness: 2
frame_duration_ms: 30
min_speech_duration_ms: 250
padding_ms: 30
Data Augmentation
noise_augment
Mix audio with random noise files at a random SNR.
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: Yes
| Parameter | Type | Default | Description |
|---|---|---|---|
noise_dir |
str | required | |
snr_range |
list[float] | [5.0, 20.0] |
- name: my_noise_augment
op: noise_augment
args:
noise_dir: <str>
snr_range: [5.0, 20.0]
reverb_augment
Add room reverb by convolving with Room Impulse Response files.
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: Yes
| Parameter | Type | Default | Description |
|---|---|---|---|
rir_dir |
str | required | |
normalize |
bool | True |
- name: my_reverb_augment
op: reverb_augment
args:
rir_dir: <str>
normalize: true
speed_perturb
Apply speed perturbation (tempo + pitch change) via resampling.
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: Yes
| Parameter | Type | Default | Description |
|---|---|---|---|
factors |
list[float] | [0.9, 1.0, 1.1] |
- name: my_speed_perturb
op: speed_perturb
args:
factors: [0.9, 1.0, 1.1]
volume_perturb
Apply random volume gain within a dB range.
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: Yes
| Parameter | Type | Default | Description |
|---|---|---|---|
min_gain_db |
float | -6.0 |
|
max_gain_db |
float | 6.0 |
- name: my_volume_perturb
op: volume_perturb
args:
min_gain_db: -6.0
max_gain_db: 6.0
Annotation
codec_tokenize
Encode audio into discrete codec tokens (EnCodec / DAC).
- Device: gpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
backend |
str | encodec |
|
bandwidth |
float | 6.0 |
|
model |
str | encodec_24khz |
- name: my_codec_tokenize
op: codec_tokenize
args:
backend: encodec
bandwidth: 6.0
model: encodec_24khz
emotion_recognize
Recognize speech emotions using emotion2vec (9 classes: angry, happy, sad, ...).
- Device: gpu
- Runtime:
vkit docker run --tag asr <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
str | iic/emotion2vec_plus_large |
|
granularity |
str | utterance |
- name: my_emotion_recognize
op: emotion_recognize
args:
model: iic/emotion2vec_plus_large
granularity: utterance
faster_whisper_asr
Transcribe audio using faster-whisper and add Supervisions with text + language.
Uses CTranslate2 for inference. Has GPU and CPU, but on CPU the compute_type is coerced to "int8" because float16 is not supported.
Warning
CTranslate2 may deadlock on macOS ARM64. Use whisper_openai_asr
on macOS instead.
- Device: gpu
- Runtime:
vkit docker run --tag asr <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
str | tiny |
|
language |
str | None | None |
|
beam_size |
int | 5 |
|
compute_type |
str | int8 |
|
cpu_threads |
int | 4 |
- name: my_faster_whisper_asr
op: faster_whisper_asr
args:
model: tiny
language: null
beam_size: 5
compute_type: int8
cpu_threads: 4
forced_align
Align text to audio at word level using Qwen3-ForcedAligner (11 languages).
- Device: gpu
- Runtime:
vkit docker run --tag asr <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
str | Qwen/Qwen3-ForcedAligner-0.6B |
|
language |
str | Chinese |
- name: my_forced_align
op: forced_align
args:
model: Qwen/Qwen3-ForcedAligner-0.6B
language: Chinese
gender_classify
Classify speaker gender using one of several methods.
Methods:
-
f0: Extract fundamental frequency via librosa's pyin. Male if median F0 < threshold (default 165 Hz), else female. Fast, no model download, but only ~80-85% accurate on clean adult speech. Fails on children, elderly, or noisy audio. -
speechbrain: Use a SpeechBrain EncoderClassifier. More accurate (~95%+) but requires model download. Default model is a speaker recognition model (placeholder) — overridespeechbrain_modelwith a true gender classifier for best results. -
inaspeechsegmenter: Use INA's speech segmenter which jointly detects speech/music/noise and classifies gender. Well-tested in broadcast media analysis (~90-95%). This backend is not included in the published Docker images because it pulls in TensorFlow. -
Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
method |
str | f0 |
|
f0_threshold |
float | 165.0 |
|
speechbrain_model |
str | speechbrain/spkrec-ecapa-voxceleb |
- name: my_gender_classify
op: gender_classify
args:
method: f0
f0_threshold: 165.0
speechbrain_model: speechbrain/spkrec-ecapa-voxceleb
mel_extract
Extract mel spectrogram and save as .npy file per cut.
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
n_fft |
int | 1024 |
|
hop_length |
int | 256 |
|
n_mels |
int | 80 |
|
fmin |
float | 0.0 |
|
fmax |
float | None | 8000.0 |
|
ref_db |
float | 20.0 |
|
output_dir |
str | None | None |
- name: my_mel_extract
op: mel_extract
args:
n_fft: 1024
hop_length: 256
n_mels: 80
fmin: 0.0
fmax: 8000.0
ref_db: 20.0
output_dir: null
normalize_text
Normalize supervisions[].text in place: strip tags, collapse spaces.
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
strip_tags |
bool | True |
|
collapse_spaces |
bool | True |
|
lowercase |
bool | False |
- name: my_normalize_text
op: normalize_text
args:
strip_tags: true
collapse_spaces: true
lowercase: false
paraformer_asr
Transcribe audio using Paraformer (FunASR).
The default model includes built-in VAD and punctuation restoration, making it suitable for long-form audio without pre-segmentation. Much faster than Whisper for Chinese.
- Device: gpu
- Runtime:
vkit docker run --tag asr <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
str | iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch |
|
language |
str | zh |
- name: my_paraformer_asr
op: paraformer_asr
args:
model: iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch
language: zh
pyannote_diarize
Add speaker-label Supervisions to each Cut using pyannote.audio.
Requires accepting the pyannote model user agreement on HuggingFace and
setting HF_TOKEN (or passing hf_token in the config).
- Device: gpu
- Runtime:
vkit docker run --tag diarize <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
str | pyannote/speaker-diarization-3.1 |
|
min_speakers |
int | None | None |
|
max_speakers |
int | None | None |
|
hf_token |
str | None | None |
- name: my_pyannote_diarize
op: pyannote_diarize
args:
model: pyannote/speaker-diarization-3.1
min_speakers: null
max_speakers: null
hf_token: null
qwen3_asr
Transcribe audio using Qwen3-ASR.
30 languages + 22 Chinese dialects. Set return_timestamps=True
to also get word-level timestamps (uses ForcedAligner internally).
- Device: gpu
- Runtime:
vkit docker run --tag asr <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
str | Qwen/Qwen3-ASR-0.6B |
|
language |
str | None | None |
|
return_timestamps |
bool | False |
|
aligner_model |
str | Qwen/Qwen3-ForcedAligner-0.6B |
|
max_new_tokens |
int | 512 |
- name: my_qwen3_asr
op: qwen3_asr
args:
model: Qwen/Qwen3-ASR-0.6B
language: null
return_timestamps: false
aligner_model: Qwen/Qwen3-ForcedAligner-0.6B
max_new_tokens: 512
sensevoice_asr
Transcribe audio using SenseVoice (FunASR).
SenseVoice supports Chinese, English, Japanese, Korean, and Cantonese.
The SenseVoiceSmall model is fast and accurate for these languages.
In addition to the transcript, each Supervision carries:
- supervision.language — model-detected language code
- supervision.custom["emotion"] — per-utterance emotion label
- supervision.custom["audio_event"] — "Speech" / "BGM" / "noise"
- Device: gpu
- Runtime:
vkit docker run --tag asr <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
str | iic/SenseVoiceSmall |
|
language |
str | auto |
- name: my_sensevoice_asr
op: sensevoice_asr
args:
model: iic/SenseVoiceSmall
language: auto
speaker_embed
Extract speaker embedding vectors using SpeechBrain or WeSpeaker.
Official VoxKitchen Docker images support method="speechbrain".
method="wespeaker" is kept for custom environments with WeSpeaker
installed.
- Device: gpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
method |
str | speechbrain |
|
wespeaker_model |
str | english |
|
speechbrain_model |
str | speechbrain/spkrec-ecapa-voxceleb |
- name: my_speaker_embed
op: speaker_embed
args:
method: speechbrain
wespeaker_model: english
speechbrain_model: speechbrain/spkrec-ecapa-voxceleb
speech_enhance
Remove background noise using DeepFilterNet neural denoiser.
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: Yes
| Parameter | Type | Default | Description |
|---|---|---|---|
method |
str | deepfilternet |
|
aggressiveness |
float | 0.5 |
- name: my_speech_enhance
op: speech_enhance
args:
method: deepfilternet
aggressiveness: 0.5
speechbrain_langid
Add a language-identification Supervision to each Cut using SpeechBrain.
Uses the VoxLingua107 ECAPA-TDNN model by default. Runs on CPU with automatic fallback from CUDA.
- Device: gpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
str | speechbrain/lang-id-voxlingua107-ecapa |
- name: my_speechbrain_langid
op: speechbrain_langid
args:
model: speechbrain/lang-id-voxlingua107-ecapa
wenet_asr
Transcribe audio using WeNet.
WeNet supports streaming and non-streaming decoding. This operator uses non-streaming (offline) mode for best accuracy.
- Device: gpu
- Runtime:
vkit docker run --tag asr <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
str | chinese |
|
language |
str | zh |
- name: my_wenet_asr
op: wenet_asr
args:
model: chinese
language: zh
whisper_langid
Detect the spoken language of each cut using Whisper.
Adds a Supervision with the detected language code (e.g., "en",
"zh", "ja"). Uses only the first 30 seconds for detection — fast
even on long recordings.
Backend selection (backend config):
- auto: prefer faster-whisper, fall back to openai-whisper
- openai: use openai-whisper (macOS-safe)
- faster-whisper: use faster-whisper (faster on GPU)
- Device: gpu
- Runtime:
vkit docker run --tag asr <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
str | tiny |
|
backend |
str | auto |
- name: my_whisper_langid
op: whisper_langid
args:
model: tiny
backend: auto
whisper_openai_asr
Transcribe audio using OpenAI's official whisper (pure PyTorch).
Works on both CPU and GPU. On CPU, set fp16: false. Auto-detects
CUDA and falls back to CPU transparently.
This is the recommended ASR operator for macOS where CTranslate2-based operators (faster_whisper_asr) may deadlock.
- Device: gpu
- Runtime:
vkit docker run --tag asr <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
str | tiny |
|
language |
str | None | None |
|
beam_size |
int | 5 |
|
fp16 |
bool | True |
- name: my_whisper_openai_asr
op: whisper_openai_asr
args:
model: tiny
language: null
beam_size: 5
fp16: true
whisperx_asr
Transcribe audio with word-level alignment using whisperx.
If whisperx is not installed, falls back to faster-whisper at segment level (no word alignment). Both paths are packaged in the ASR Docker runtime.
- Device: gpu
- Runtime:
vkit docker run --tag asr <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
str | tiny |
|
language |
str | None | None |
|
batch_size |
int | 8 |
|
compute_type |
str | int8 |
- name: my_whisperx_asr
op: whisperx_asr
args:
model: tiny
language: null
batch_size: 8
compute_type: int8
Quality & Filtering
audio_fingerprint_dedup
Remove near-duplicate cuts using MFCC mean features + simhash.
For each cut, a 13-coefficient MFCC mean vector is computed and hashed
with simhash. Cuts whose hash is within similarity_threshold bits
(hamming distance) of any previously seen hash are dropped as duplicates.
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
similarity_threshold |
int | 3 |
- name: my_audio_fingerprint_dedup
op: audio_fingerprint_dedup
args:
similarity_threshold: 3
bandwidth_estimate
Estimate effective audio bandwidth and store in metrics.
Detects files that were upsampled from a lower sample rate — e.g., an
8 kHz telephone recording saved as 48 kHz WAV will show
bandwidth_khz ≈ 4.0.
Computes STFT, measures mean power per frequency bin, then finds the
frequency where energy drops sharply (ratio method). Writes:
- metrics["bandwidth_khz"]: effective bandwidth in kHz
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
nfft |
int | 512 |
|
hop |
int | 256 |
- name: my_bandwidth_estimate
op: bandwidth_estimate
args:
nfft: 512
hop: 256
cer_wer
Compute CER and WER between ASR output and reference text.
Reference text is read from cut.custom[reference_field] (default key:
"reference_text"). Cuts without a reference are passed through unchanged.
With normalize=True (default) both hypothesis and reference are
normalized before comparison:
- SenseVoice <|zh|><|HAPPY|>… tags are stripped
- Paraformer's inter-character spaces are removed
- Punctuation is discarded
- Text is lowercased
This makes CER directly comparable across different ASR backends.
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
hypothesis_field |
str | text |
|
reference_field |
str | reference_text |
|
normalize |
bool | True |
- name: my_cer_wer
op: cer_wer
args:
hypothesis_field: text
reference_field: reference_text
normalize: true
clipping_detect
Detect audio clipping and store the ratio of clipped samples.
Clipping occurs when recording levels are too high, causing the waveform to be truncated at the maximum amplitude ceiling. This produces harsh distortion that degrades ASR and TTS training.
Writes metrics["clipping_ratio"] — fraction of samples whose
absolute value exceeds ceiling (default 0.99). A ratio > 0.01
indicates significant clipping.
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
ceiling |
float | 0.99 |
- name: my_clipping_detect
op: clipping_detect
args:
ceiling: 0.99
dnsmos_score
Score audio quality using Microsoft DNSMOS (no reference needed).
Writes four metrics:
- dnsmos_ovrl — P.835 overall quality (1-5)
- dnsmos_sig — P.835 speech signal quality (1-5)
- dnsmos_bak — P.835 background noise quality (1-5)
- dnsmos_p808 — P.808 overall MOS (1-5)
Higher is better. Typically dnsmos_ovrl > 3.0 is considered
acceptable for training data.
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
use_gpu |
bool | False |
- name: my_dnsmos_score
op: dnsmos_score
args:
use_gpu: false
duration_filter
Drop Cuts whose duration falls outside [min_duration, max_duration].
This is an N-to-fewer operator: no audio is read or written.
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
min_duration |
float | 0.0 |
|
max_duration |
float | None | None |
- name: my_duration_filter
op: duration_filter
args:
min_duration: 0.0
max_duration: null
pitch_stats
Compute pitch (F0) statistics using PyWorld (dio + stonemask).
More accurate than librosa.pyin for speech. Writes:
- metrics["pitch_mean"] — mean F0 in Hz (voiced frames only)
- metrics["pitch_std"] — normalized std (0-1 range, pitch-independent)
A pitch_mean of 0 means no voiced frames were detected.
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
f0_min |
float | 50.0 |
|
f0_max |
float | 2400.0 |
|
frame_period_ms |
float | 5.0 |
- name: my_pitch_stats
op: pitch_stats
args:
f0_min: 50.0
f0_max: 2400.0
frame_period_ms: 5.0
quality_score_filter
Drop Cuts that do not satisfy all conditions.
Each condition is a whitespace-separated triple field.path op value
where op is one of >, >=, <, <=, ==, !=.
All conditions are AND-ed together.
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
conditions |
list[str] | required |
- name: my_quality_score_filter
op: quality_score_filter
args:
conditions: <list[str]>
snr_estimate
Estimate SNR via a peak-to-RMS ratio and store it in cut.metrics["snr"].
This is a rough proxy (not WADA-SNR or model-based) sufficient for v0.1. No audio is written; only the metrics dict is updated.
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: No
- name: my_snr_estimate
op: snr_estimate
speaker_similarity
Score speaker similarity against a reference embedding (cosine).
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
reference_path |
str | required | |
embedding_key |
str | speaker_embedding |
- name: my_speaker_similarity
op: speaker_similarity
args:
reference_path: <str>
embedding_key: speaker_embedding
utmos_score
Predict speech naturalness MOS using UTMOS (no reference needed).
Writes metrics["utmos"] — predicted MOS score (1-5).
Higher is better. Scores > 4.0 indicate natural-sounding speech.
Useful for filtering synthetic/degraded audio from training data.
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: No
- name: my_utmos_score
op: utmos_score
Synthesis
tts_chattts
Synthesize conversational speech using ChatTTS.
- Device: gpu
- Runtime:
vkit docker run --tag tts <yaml> - Produces audio: Yes
| Parameter | Type | Default | Description |
|---|---|---|---|
seed |
int | None | None |
|
temperature |
float | 0.3 |
|
top_p |
float | 0.7 |
|
top_k |
int | 20 |
- name: my_tts_chattts
op: tts_chattts
args:
seed: null
temperature: 0.3
top_p: 0.7
top_k: 20
tts_cosyvoice
Synthesize speech using CosyVoice2 with optional voice cloning.
- Device: gpu
- Runtime:
vkit docker run --tag tts <yaml> - Produces audio: Yes
| Parameter | Type | Default | Description |
|---|---|---|---|
model_id |
str | FunAudioLLM/CosyVoice2-0.5B |
|
mode |
str | sft |
|
spk_id |
str | default |
|
reference_audio |
str | None | None |
|
reference_text |
str | None | None |
- name: my_tts_cosyvoice
op: tts_cosyvoice
args:
model_id: FunAudioLLM/CosyVoice2-0.5B
mode: sft
spk_id: default
reference_audio: null
reference_text: null
tts_fish_speech
Synthesize speech using Fish-Speech codec language model.
- Device: gpu
- Runtime:
vkit docker run --tag fish-speech <yaml> - Produces audio: Yes
| Parameter | Type | Default | Description |
|---|---|---|---|
model_id |
str | fishaudio/s2-pro |
|
reference_audio |
str | None | None |
|
reference_text |
str | None | None |
|
max_new_tokens |
int | 1024 |
|
top_p |
float | 0.8 |
|
temperature |
float | 0.8 |
|
repetition_penalty |
float | 1.1 |
|
chunk_length |
int | 200 |
|
seed |
int | None | None |
|
compile |
bool | False |
|
half |
bool | False |
- name: my_tts_fish_speech
op: tts_fish_speech
args:
model_id: fishaudio/s2-pro
reference_audio: null
reference_text: null
max_new_tokens: 1024
top_p: 0.8
temperature: 0.8
repetition_penalty: 1.1
chunk_length: 200
seed: null
compile: false
half: false
tts_kokoro
Synthesize speech from text using Kokoro TTS.
- Device: cpu
- Runtime:
vkit docker run --tag tts <yaml> - Produces audio: Yes
| Parameter | Type | Default | Description |
|---|---|---|---|
voice |
str | af_heart |
|
lang_code |
str | a |
|
speed |
float | 1.0 |
- name: my_tts_kokoro
op: tts_kokoro
args:
voice: af_heart
lang_code: a
speed: 1.0
Output / Packing
pack_huggingface
Export CutSet as a HuggingFace Dataset with audio column.
Warning
Recent HuggingFace datasets versions may require torchcodec
when decoding Audio rows into arrays. Install torchcodec in the
training environment, or cast with Audio(decode=False) when reading
metadata/embedded audio bytes only.
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: Yes
| Parameter | Type | Default | Description |
|---|---|---|---|
output_dir |
str | None | None |
- name: my_pack_huggingface
op: pack_huggingface
args:
output_dir: null
pack_jsonl
Write a flat JSONL manifest — one JSON object per line.
Fields: id, origin_id, start, end, duration, sample_rate, text, snr, gender (male/female/unknown), speaker, language.
start/end are the VAD segment boundaries in the original
recording. origin_id traces back to the source filename.
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
output_path |
str | None | None |
- name: my_pack_jsonl
op: pack_jsonl
args:
output_path: null
pack_kaldi
Export CutSet in Kaldi format (wav.scp, text, utt2spk, spk2utt).
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
output_dir |
str | None | None |
- name: my_pack_kaldi
op: pack_kaldi
args:
output_dir: null
pack_manifest
Write a flat manifest (cuts.jsonl.gz) with no audio export.
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: No
- name: my_pack_manifest
op: pack_manifest
pack_parquet
Export CutSet as Apache Parquet with audio file references.
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: No
| Parameter | Type | Default | Description |
|---|---|---|---|
output_dir |
str | None | None |
- name: my_pack_parquet
op: pack_parquet
args:
output_dir: null
pack_webdataset
Export CutSet as WebDataset tar shards with embedded audio.
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: Yes
| Parameter | Type | Default | Description |
|---|---|---|---|
output_dir |
str | None | None |
|
shard_size |
int | 1000 |
- name: my_pack_webdataset
op: pack_webdataset
args:
output_dir: null
shard_size: 1000
Utility
identity
Pass cuts through unchanged (no-op, useful for testing).
- Device: cpu
- Runtime:
vkit docker run --tag slim <yaml> - Produces audio: No
- name: my_identity
op: identity