Speaker Analysis

Analyze speaker distribution in audio data: identify speakers, extract embeddings, classify gender and language.

Quick Start

vkit init my-speaker-project --template speaker
cd my-speaker-project

# Set HuggingFace token (required for pyannote diarization)
echo "HF_TOKEN=hf_your_token" > .env

# Put your audio files in ./data/
vkit docker run --tag latest pipeline.yaml --dry-run
vkit docker run --tag latest pipeline.yaml

What the Pipeline Does

Stage	Operator	Output
Resample	`resample` → 16kHz	Normalized audio
VAD	`silero_vad`	Speech segments
Diarize	`pyannote_diarize`	Speaker labels (spk_0, spk_1, ...)
Embed	`speaker_embed` (SpeechBrain)	Speaker embedding vector per segment
Gender	`gender_classify` (F0-based)	"m" / "f" / "o" per segment
Language	`whisper_langid`	Detected language per segment
Pack	`pack_jsonl`	Manifest with all annotations

Use Cases

Speaker counting

After running the pipeline, count unique speakers:

from voxkitchen.schema.cutset import CutSet

cuts = CutSet.from_jsonl_gz("work/.../06_pack/cuts.jsonl.gz")
speakers = set()
for cut in cuts:
    for sup in cut.supervisions:
        if sup.speaker:
            speakers.add(sup.speaker)
print(f"Found {len(speakers)} speakers")

Speaker similarity (using embeddings)

Compare two speakers using cosine similarity:

import numpy as np

# Get embeddings from two cuts
emb1 = np.array(cut1.custom["speaker_embedding"])
emb2 = np.array(cut2.custom["speaker_embedding"])

# Cosine similarity (1.0 = same speaker, 0.0 = different)
similarity = np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))
print(f"Similarity: {similarity:.3f}")
# > 0.65 → likely same speaker
# < 0.40 → likely different speakers

Gender distribution

from collections import Counter

genders = Counter()
for cut in cuts:
    for sup in cut.supervisions:
        if sup.gender:
            genders[sup.gender] += 1
print(genders)  # Counter({'m': 150, 'f': 120, 'o': 5})

Customization

Without diarization (simpler, no HF_TOKEN needed)

Remove the diarize stage if your audio is already single-speaker per file:

stages:
  - name: resample
    op: resample
    args: { target_sr: 16000, target_channels: 1 }
  - name: vad
    op: silero_vad
    args: { threshold: 0.5 }
  - name: embed
    op: speaker_embed
    args: { method: speechbrain }
  - name: pack
    op: pack_jsonl

Using a specific SpeechBrain model

  - name: embed
    op: speaker_embed
    args:
      method: speechbrain
      speechbrain_model: speechbrain/spkrec-ecapa-voxceleb

The WeSpeaker backend is kept for custom environments, but official VoxKitchen Docker images use SpeechBrain for this operator.

Add emotion recognition

  - name: emotion
    op: emotion_recognize
    args:
      model: iic/emotion2vec_plus_large