TTS Training Data
Prepare high-quality TTS training data from raw recordings. The pipeline is mostly a quality gate: it standardizes format, segments to a usable utterance length, denoises just enough, drops low-quality segments, and adds ASR text + timestamps so each kept segment is paired with a reliable label.
Looking for the inverse direction — generating speech from text? Two companion tutorials split synthesis by use case: Speaker TTS (pick a built-in voice) and Voice Cloning & TTS (clone a voice from a short reference).
Quick Start
vkit init my-tts-project --template tts
cd my-tts-project
# Put your audio files in ./data/
vkit docker run --tag asr pipeline.yaml --dry-run
vkit docker run --tag asr pipeline.yaml
What the Pipeline Does
| Stage | Operator | Why |
|---|---|---|
| Format convert | ffmpeg_convert |
Handle any input format (opus, flac, mp3, etc.) |
| Resample | resample → 22.05kHz mono |
TTS standard sample rate; mono for single-speaker |
| Denoise | speech_enhance (aggressiveness=0.3) |
Light denoising — preserve natural speech quality |
| VAD | silero_vad (min 1s) |
Split into utterances; 1s minimum for meaningful TTS segments |
| SNR | snr_estimate |
Measure signal-to-noise ratio |
| Filter | quality_score_filter |
Keep only 2–15s segments with SNR > 15 dB |
| ASR + Align | qwen3_asr (timestamps=true) |
Transcribe + word-level forced alignment in one pass |
| Pack | pack_jsonl |
Output manifest with text, timestamps, quality metrics |
Quality Checklist
Every kept segment in the output manifest satisfies all of these. Edit
the filter stage if your dataset needs different thresholds.
| Check | Default | Why this matters for TTS |
|---|---|---|
| Sample rate | 22.05 kHz mono | Matches VITS / FastSpeech2 / Tacotron2 training assumptions |
| Duration | 2 s ≤ len ≤ 15 s | < 2 s gives no prosody; > 15 s strains attention |
| SNR | > 15 dB | TTS reproduces noise; 15 dB ≈ speech 30× louder than noise |
| Text | present and non-empty | TTS training needs paired text |
| Alignment | word-level present | Enables phoneme-aligned training / inspection |
Key Design Decisions
Why 22.05 kHz?
Most TTS systems (VITS, Tacotron2, FastSpeech2) use 22.05 kHz. Using 16 kHz loses high-frequency detail that matters for speech naturalness. 44.1/48 kHz is unnecessarily large.
Why light denoising (0.3)?
Aggressive denoising removes subtle speech nuances (breath, lip sounds) that make TTS output sound natural. For TTS, clean-but-natural is better than sterile-but-artificial. If your source audio is very noisy, increase to 0.5.
Why 2–15 seconds?
- < 2s: Too short for TTS models to learn prosody patterns
- > 15s: Attention-based TTS models struggle with long sequences
- Sweet spot for most TTS architectures: 5–10 seconds
Why SNR > 15 dB?
TTS training is more sensitive to noise than ASR training. The model learns to reproduce whatever is in the audio, including noise. SNR 15 dB means the speech is ~30x louder than noise — clean enough for high-quality synthesis.
Customization
For Chinese TTS
Change the ASR language detection to Chinese for better accuracy:
- name: asr
op: qwen3_asr
args:
model: Qwen/Qwen3-ASR-0.6B
language: Chinese
return_timestamps: true
For multi-speaker TTS
Add speaker diarization before packing. This mixes ASR and diarization
operators, so run the edited pipeline with --tag latest.
- name: diarize
op: pyannote_diarize
args:
model: pyannote/speaker-diarization-3.1
Adjust quality thresholds
# Stricter (fewer but cleaner segments)
conditions:
- "duration > 3"
- "duration < 12"
- "metrics.snr > 20"
# Looser (more data, tolerate some noise)
conditions:
- "duration > 1"
- "duration < 20"
- "metrics.snr > 10"
Output Format
The final pack_jsonl stage produces a JSONL manifest where each line is a JSON object:
{
"id": "segment_001__enhanced",
"duration": 5.2,
"text": "这是一段示例语音",
"word_alignments": [
{"text": "这", "start": 0.12, "end": 0.35},
{"text": "是", "start": 0.35, "end": 0.52},
...
],
"metrics": {"snr": 22.5}
}
Next Steps
- Train a TTS model on the resulting manifest using your framework of choice (Tacotron2, VITS, FastSpeech2, Coqui XTTS, etc.).
- Or feed the manifest to a pretrained engine to A/B compare your data against existing voices or to generate new audio from the transcripts you just extracted:
- Speaker TTS — kokoro, ChatTTS, CosyVoice with a built-in speaker id.
- Voice Cloning & TTS — CosyVoice zero-shot, Fish-Speech: synthesize in a voice you supply via reference audio.