Speaker TTS
Synthesize text in a built-in voice — a named voice that ships with the engine, or a seed-sampled timbre you can reproduce by fixing the seed. No reference audio required.
Need to clone a specific voice from a short recording instead? See Voice Cloning & TTS. Preparing audio to train your own TTS model? See TTS Training Data.
Quick Start (Python)
The fastest path is voxkitchen.tools.synthesize from inside a Docker
shell — one line per utterance, no YAML.
from voxkitchen.tools import synthesize
# Built-in English voice, runs on CPU
synthesize("Hello world!", "out.wav", engine="kokoro")
# Built-in Mandarin voice
synthesize("你好,世界。", "zh.wav", engine="kokoro", language="z")
Run from inside an image that has the right engine installed:
vkit docker shell --tag tts
# inside the container:
python -c "from voxkitchen.tools import synthesize; synthesize('Hello!', '/app/output/hello.wav', engine='kokoro')"
Quick Start (YAML)
Batch synthesis runs through the standard pipeline: ingest a manifest
of text-only cuts, then a tts_<engine> stage, then pack the results.
# yaml-language-server: $schema=https://raw.githubusercontent.com/XqFeng-Josie/VoxKitchen/main/docs/schemas/pipeline.schema.json
version: "0.1"
name: tts-speaker
work_dir: ./work/${name}
ingest:
source: manifest
args:
path: ./text_scripts.jsonl.gz
stages:
- name: synthesize
op: tts_kokoro
args:
voice: af_heart
lang_code: a
speed: 1.0
- name: pack
op: pack_jsonl
The input manifest is a regular VoxKitchen cuts.jsonl.gz whose
supervisions carry text (no audio needed). Each text supervision is
synthesized into one new WAV file written under
<work_dir>/<stage>/derived/<cut-id>__<engine>.wav, and the output
manifest references those files.
A ready-to-use example lives at
examples/pipelines/tts-synthesis.yaml.
Engine Capabilities
| Engine | Voice source | Language scope | Device | Output SR | Image |
|---|---|---|---|---|---|
tts_kokoro |
named voices (e.g. af_heart) |
8 codes: a AmE, b BrE, j JA, z ZH, f FR, h HI, i IT, p BR-PT |
CPU | 24 kHz | tts |
tts_chattts |
seed-sampled timbre (set seed to reproduce) |
Chinese & English | GPU | 24 kHz | tts |
tts_cosyvoice (sft mode) |
built-in spk_id (no reference audio) |
Multilingual (zh, en, ja, ko, …) | GPU | 24 kHz | tts |
All three engines share the tts image. Mixing them with ASR or
diarization stages in a single pipeline requires latest.
Kokoro — CPU, 8 languages
- name: synthesize
op: tts_kokoro
args:
voice: af_heart # English female; see Kokoro voice list
lang_code: a # a=AmE, b=BrE, j=Japanese, z=Mandarin, …
speed: 1.0 # 0.8 = slower, 1.2 = faster
Kokoro is the only engine in this group that runs comfortably on CPU and the only one shipping a wide language menu. Use it as a default when GPU is scarce or when you need multilingual coverage.
CosyVoice — sft mode, built-in speakers
- name: synthesize
op: tts_cosyvoice
args:
mode: sft
spk_id: default # or another built-in speaker id
sft skips the reference-audio path entirely, so it's the fastest of
CosyVoice's three modes. For the reference-driven modes (zero_shot,
cross_lingual), see Voice Cloning & TTS.
ChatTTS — conversational style, seed-driven
- name: synthesize
op: tts_chattts
args:
seed: 42 # fix to keep speaker timbre stable across calls
temperature: 0.3
seed controls the sampled timbre. Leave it null for a different
"voice" each call. Prosody tags like [laugh], [uv_break] can be
embedded directly in the text.
Choosing An Engine
- Need to run without a GPU? →
tts_kokoro. - Multilingual coverage matters most? →
tts_kokoro(8 languages including ZH / JA). - Mandarin or English read-aloud with a "natural" feel? →
tts_chattts; pinseedfor stability. - Want a fixed named speaker on GPU? →
tts_cosyvoiceinsftmode.
When in doubt, run
vkit validate pipeline.yaml
— it prints the smallest Docker tag that can run all stages in your pipeline.
Preparing The Input Manifest
The synthesis pipeline reads a cuts.jsonl.gz whose cuts already carry
text. Two common ways to produce one:
- Write Python that emits
Cutrecords withSupervision(text=...)andto_jsonl_gzto disk. See Python Tools API forCut/CutSet. - Reuse the output of an ASR pipeline — every cut from an ASR run
already has
text. Point this synthesis pipeline at that manifest to resynthesize the same utterances with a different voice.
Inspecting Outputs
vkit inspect run work/tts-speaker/
vkit inspect cuts work/tts-speaker/01_pack/cuts.jsonl.gz
The synthesized WAV files live under each TTS stage's derived/
directory. The final manifest's recordings point at them.
Related
- Voice Cloning & TTS — clone a voice from a short reference instead of picking a built-in one.
- TTS Training Data — quality gate for raw speech you want to use to train your own TTS model.
- Python Tools API — TTS Synthesis
— full signature of
synthesize(). - Operators reference — every TTS operator's full config schema.