ASR Training Data
Prepare augmented ASR training data with automatic transcription.
Quick Start
vkit init my-asr-project --template asr
cd my-asr-project
# Put your audio files in ./data/
vkit docker run --tag asr pipeline.yaml --dry-run
vkit docker run --tag asr pipeline.yaml
What the Pipeline Does
| Stage | Operator | Why |
|---|---|---|
| Resample | resample → 16kHz mono |
ASR standard; most pretrained models expect 16kHz |
| VAD | silero_vad |
Split long recordings into utterance-level segments |
| Speed perturb | speed_perturb [0.9, 1.0, 1.1] |
3x data augmentation — improves ASR robustness |
| Volume perturb | volume_perturb [-3, +3] dB |
Simulates varying recording conditions |
| ASR | faster_whisper_asr large-v3 |
Generate text labels for training |
| Filter | quality_score_filter |
Remove too-short/too-long segments |
| Pack | pack_huggingface |
Output as HuggingFace Dataset (ready for training) |
Key Design Decisions
Why speed perturbation?
Speed perturbation at factors [0.9, 1.0, 1.1] is the single most effective data augmentation for ASR. It simulates different speaking rates and slightly shifts pitch, making the model robust to natural variation. This triples your training data.
Why not noise augmentation?
Noise augmentation (noise_augment) is also effective but requires a noise dataset (e.g., MUSAN). The default template keeps it simple — no external data dependencies. To add noise augmentation:
# Add after volume_aug, requires noise files in ./data/noise/
- name: noise_aug
op: noise_augment
args:
noise_dir: ./data/noise
snr_range: [5, 20]
Put MUSAN or another noise dataset under ./data/noise/ before enabling
this stage. The MUSAN recipe is tracked in the roadmap, but is not a built-in
recipe yet.
Why HuggingFace output?
pack_huggingface produces a dataset loadable with datasets.load_from_disk(), which integrates directly with HuggingFace training pipelines (Transformers, SpeechBrain, ESPnet).
By default, the template writes the final dataset to ./output/hf_dataset.
The audio column is embedded in the HuggingFace Dataset's Arrow shard, so the
final result is not a directory of standalone WAV files. Use
--keep-intermediates or gc_mode: keep when you also need to preserve each
stage's derived WAV files under ./work.
Load the exported dataset with:
from datasets import load_from_disk
ds = load_from_disk("./output/hf_dataset")
Recent HuggingFace datasets versions decode Audio columns through
torchcodec. If your training code needs audio["array"], install it in the
training environment:
pip install torchcodec
For metadata checks or custom audio decoding, avoid automatic decode:
from datasets import Audio, load_from_disk
ds = load_from_disk("./output/hf_dataset")
ds = ds.cast_column("audio", Audio(decode=False))
row = ds[0]
audio_bytes = row["audio"]["bytes"]
Customization
For Chinese ASR
Replace faster_whisper_asr with Qwen3 or Paraformer:
- name: asr
op: qwen3_asr
args:
model: Qwen/Qwen3-ASR-0.6B
language: Chinese
# Or use Paraformer (optimized for Chinese)
- name: asr
op: paraformer_asr
args:
model: iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch
Start from a known dataset
Replace dir ingest with a recipe:
ingest:
source: recipe
recipe: librispeech
args:
root: ./data/librispeech
subsets: [train-clean-100]
Kaldi output format
Replace pack_huggingface with:
- name: pack
op: pack_kaldi
Produces wav.scp, text, utt2spk, spk2utt files.