Skip to content

Data Cleaning

Clean up raw audio data: measure quality, remove duplicates, filter out bad files.

Quick Start

vkit init my-cleaning-project --template cleaning
cd my-cleaning-project
# Put your raw audio files in ./data/
vkit docker run --tag slim pipeline.yaml --dry-run
vkit docker run --tag slim pipeline.yaml
vkit inspect cuts work/*/05_filter/cuts.jsonl.gz

What the Pipeline Does

Stage Operator What it measures/does
Resample resample → 16kHz Normalize all audio to the same format
SNR snr_estimate Signal-to-noise ratio (dB)
Clipping clipping_detect Ratio of clipped samples (0.0 = no clipping)
Bandwidth bandwidth_estimate Effective bandwidth in kHz (detects upsampled audio)
Dedup audio_fingerprint_dedup Remove near-duplicate audio via MFCC + SimHash
Filter quality_score_filter Drop files that fail quality thresholds
Pack pack_jsonl Output manifest with all quality metrics

Understanding the Metrics

SNR (Signal-to-Noise Ratio)

SNR Quality Typical source
< 5 dB Unusable Street recording, heavy background
5–15 dB Noisy Meeting room, casual recording
15–25 dB Clean Studio with some ambient noise
> 25 dB Very clean Professional studio

Clipping Ratio

Ratio Meaning
0.0 No clipping — good
< 0.01 Minimal clipping — acceptable
> 0.01 Significant clipping — likely audible distortion

Bandwidth

kHz Meaning
< 4 Telephone quality (8kHz sample rate equivalent)
4–7 Wideband telephony
> 7 Full-band — genuine high-quality recording

A file saved as 16kHz WAV but with bandwidth 3.5 kHz was upsampled from 8kHz — the extra samples contain no real information.

Customization

Add DNSMOS quality scoring

For a perceptual quality score (1–5 MOS scale):

  - name: dnsmos
    op: dnsmos_score

  - name: filter
    op: quality_score_filter
    args:
      conditions:
        - "metrics.snr > 10"
        - "metrics.dnsmos_ovrl > 3.0"   # perceptual quality > 3.0/5.0

Keep only long-form audio

  - name: filter
    op: quality_score_filter
    args:
      conditions:
        - "duration > 10"     # minimum 10 seconds
        - "duration < 300"    # maximum 5 minutes

Stricter deduplication

Lower threshold = stricter (fewer false negatives, more false positives):

  - name: dedup
    op: audio_fingerprint_dedup
    args:
      similarity_threshold: 2   # default is 3