Show HN: Diarize – CPU-only speaker diarization, 7x faster than pyannote

2•loookas•7h ago

Comments

loookas•7h ago

I built this because I needed speaker diarization for two things: a meeting summarization script (record → diarize → transcribe → feed to Claude for summaries), and a robotics project where I need real-time speaker identification.

I started with pyannote, which is the standard tool for this. It worked, but processing a single call took forever on CPU, and the fans on my MacBook sounded like a jet engine. So I decided to build something faster.

The pipeline: Silero VAD → WeSpeaker ResNet34 embeddings (ONNX Runtime) → GMM+BIC speaker count estimation → spectral clustering. All classical ML after the embedding step — no neural segmentation model like pyannote uses.

Results on VoxConverse (216 files, 1–20 speakers):

DER: ~10.8% (pyannote free models: ~11.2%) CPU speed: RTF 0.12 vs 0.86 (pyannote community-1) — about 7x faster 10-min recording: ~1.2 min vs ~8.6 min Speaker count: 87–97% within ±1 for 1–5 speakers

What it doesn't do well: 8+ speakers (count estimation breaks down), overlapping speech (single speaker per frame), and it's only been benchmarked on one dataset so far.

Usage: pip install diarize

from diarize import diarize result = diarize("meeting.wav")

No GPU, no API keys, no HuggingFace account. Apache 2.0. Happy to answer questions about the architecture, benchmarks, or tradeoffs.

guerython•7h ago

Nice to see Diarize lean into CPU-only inference for compliance workloads. We leaned on the same Silero -> embedding -> spectral stack and one stabilizer that helped was filtering Silero segments under ~350 ms and merging anything with cosine distance <0.25 before the GMM, so the clustering stopped flipping speakers on micro-pauses.

Another lever we added was keeping the last few call centroids and biasing the spectral solver toward the prototype that had >0.75 similarity, which keeps returning participants from spawning a new SPEAKER label every session. Are you thinking about exposing that kind of anchor_embeddings hook so teams can keep participant IDs consistent across calls?

loookas•7h ago

Good tips on the pre-clustering filtering- we do something similar with the 0.4s threshold on short segments, but the cosine distance merge before GMM is interesting, I'll look into that.

on the cross-session speaker consistency— yes, that's on the roadmap. The plan is to store speaker embeddings (256-dim vectors) in a vector DB and use them for matching during diarization.

Something like an anchor_embeddings parameter you can pass in, so the output labels stay consistent across calls.

Right now every call produces SPEAKER_00, SPEAKER_01 etc. independently. the embedding extraction already works well enough for matching (that's what cosine similarity on WeSpeaker embeddings is good at), the missing piece is the API surface and the matching logic on top of clustering.

What's your setup for storing/matching the centroids? Curious if you're doing it at inference time or as a post-processing step.

loookas•6h ago

One thing I found surprising during development: the speaker count estimation turned out to be the hardest part of the whole pipeline, not the embeddings or clustering.

Most diarization papers treat it as a solved problem or skip it entirely ("assume N speakers"). But in real meetings nobody tells you upfront how many people are on the call. GMM+BIC gets you to 51% exact match on VoxConverse, which sounds bad until you look at it per bucket — for 1–4 speakers it's 54–91% exact and 88–97% within ±1. It's 8+ speakers where it completely falls apart (0% exact match) .

Curious if anyone has found better approaches for automatic speaker count estimation that don't require a neural model.

Show HN: Explain Curl Commands

Show HN: Online OCR Free – Batch OCR UI for Tesseract, Gemini and OpenRouter

Show HN: Open-Source Article 12 Logging Infrastructure for the EU AI Act

Show HN: Effective Git

Show HN: TrAIn of Thought – AI chat as I want it to be

Show HN: Agent Action Protocol (AAP) – MCP got us started, but is insufficient

Show HN: A tool to give every local process a stable URL

Show HN: We want to displace Notion with collaborative Markdown files

Show HN: I built a sub-500ms latency voice agent from scratch

Show HN: Demucs music stem separator rewritten in Rust – runs in the browser

Show HN: React-Kino – Cinematic scroll storytelling for React (1KB core)

Show HN: OpenMandate – Declare what you need, get matched

Show HN: Apcher – Generate self-hosted Node.js workflows from prompts

Show HN: Omni – Open-source workplace search and chat, built on Postgres

Show HN: Pianoterm – Run shell commands from your Piano. A Linux CLI tool

Show HN: Timber – Ollama for classical ML models, 336x faster than Python

Show HN: AI tool that brutally roasts your AI agent ideas

Show HN: uBlock filter list to blur all Instagram Reels

Show HN: Govbase – Follow a bill from source text to news bias to social posts

Show HN: DejaShip – an intent ledger to stop AI agents from building duplicates

Show HN: Sai – Your always-on co-worker

Show HN: Herniated disc made me build a back-safe kettlebell app

Show HN: Web Audio Studio – A Visual Debugger for Web Audio API Graphs

Show HN: Kai – macOS native fully autonomous AI agent.

Show HN: Visual Lambda Calculus – a thesis project (2008) revived for the web

Show HN: Diarize – CPU-only speaker diarization, 7x faster than pyannote

Show HN: PingMeBud – A macOS app that listens to meetings so you don't have to

Show HN: LazyTail – Terminal log viewer with built-in MCP server for AI analysis

Show HN: FixYou – AI tool that tells you which cancer screenings you need

Show HN: Qast – Cast anything (files, URLs, screen) to any TV from the CLI

Show HN: Diarize – CPU-only speaker diarization, 7x faster than pyannote

Comments

Show HN: Explain Curl Commands

Show HN: Online OCR Free – Batch OCR UI for Tesseract, Gemini and OpenRouter

Show HN: Open-Source Article 12 Logging Infrastructure for the EU AI Act

Show HN: Effective Git

Show HN: TrAIn of Thought – AI chat as I want it to be

Show HN: Agent Action Protocol (AAP) – MCP got us started, but is insufficient

Show HN: A tool to give every local process a stable URL

Show HN: We want to displace Notion with collaborative Markdown files

Show HN: I built a sub-500ms latency voice agent from scratch

Show HN: Demucs music stem separator rewritten in Rust – runs in the browser

Show HN: React-Kino – Cinematic scroll storytelling for React (1KB core)

Show HN: OpenMandate – Declare what you need, get matched

Show HN: Apcher – Generate self-hosted Node.js workflows from prompts

Show HN: Omni – Open-source workplace search and chat, built on Postgres

Show HN: Pianoterm – Run shell commands from your Piano. A Linux CLI tool

Show HN: Timber – Ollama for classical ML models, 336x faster than Python

Show HN: AI tool that brutally roasts your AI agent ideas

Show HN: uBlock filter list to blur all Instagram Reels

Show HN: Govbase – Follow a bill from source text to news bias to social posts

Show HN: DejaShip – an intent ledger to stop AI agents from building duplicates

Show HN: Sai – Your always-on co-worker

Show HN: Herniated disc made me build a back-safe kettlebell app

Show HN: Web Audio Studio – A Visual Debugger for Web Audio API Graphs

Show HN: Kai – macOS native fully autonomous AI agent.

Show HN: Visual Lambda Calculus – a thesis project (2008) revived for the web

Show HN: Diarize – CPU-only speaker diarization, 7x faster than pyannote

Show HN: PingMeBud – A macOS app that listens to meetings so you don't have to

Show HN: LazyTail – Terminal log viewer with built-in MCP server for AI analysis

Show HN: FixYou – AI tool that tells you which cancer screenings you need

Show HN: Qast – Cast anything (files, URLs, screen) to any TV from the CLI