Another lever we added was keeping the last few call centroids and biasing the spectral solver toward the prototype that had >0.75 similarity, which keeps returning participants from spawning a new SPEAKER label every session. Are you thinking about exposing that kind of anchor_embeddings hook so teams can keep participant IDs consistent across calls?
on the cross-session speaker consistency— yes, that's on the roadmap. The plan is to store speaker embeddings (256-dim vectors) in a vector DB and use them for matching during diarization.
Something like an anchor_embeddings parameter you can pass in, so the output labels stay consistent across calls.
Right now every call produces SPEAKER_00, SPEAKER_01 etc. independently. the embedding extraction already works well enough for matching (that's what cosine similarity on WeSpeaker embeddings is good at), the missing piece is the API surface and the matching logic on top of clustering.
What's your setup for storing/matching the centroids? Curious if you're doing it at inference time or as a post-processing step.
Most diarization papers treat it as a solved problem or skip it entirely ("assume N speakers"). But in real meetings nobody tells you upfront how many people are on the call. GMM+BIC gets you to 51% exact match on VoxConverse, which sounds bad until you look at it per bucket — for 1–4 speakers it's 54–91% exact and 88–97% within ±1. It's 8+ speakers where it completely falls apart (0% exact match) .
Curious if anyone has found better approaches for automatic speaker count estimation that don't require a neural model.
loookas•7h ago
I started with pyannote, which is the standard tool for this. It worked, but processing a single call took forever on CPU, and the fans on my MacBook sounded like a jet engine. So I decided to build something faster.
The pipeline: Silero VAD → WeSpeaker ResNet34 embeddings (ONNX Runtime) → GMM+BIC speaker count estimation → spectral clustering. All classical ML after the embedding step — no neural segmentation model like pyannote uses.
Results on VoxConverse (216 files, 1–20 speakers):
DER: ~10.8% (pyannote free models: ~11.2%) CPU speed: RTF 0.12 vs 0.86 (pyannote community-1) — about 7x faster 10-min recording: ~1.2 min vs ~8.6 min Speaker count: 87–97% within ±1 for 1–5 speakers
What it doesn't do well: 8+ speakers (count estimation breaks down), overlapping speech (single speaker per frame), and it's only been benchmarked on one dataset so far.
Usage: pip install diarize
from diarize import diarize result = diarize("meeting.wav")
No GPU, no API keys, no HuggingFace account. Apache 2.0. Happy to answer questions about the architecture, benchmarks, or tradeoffs.