1 hour of audio processed in 5 seconds (RTX 4090, Ryzen 9 7950X). ~17x faster than Pyannote 3.1.
On M3 MacBook Air, 1 hour in 23.5 seconds (~14x faster).
This is a custom speaker diarization pipeline I've developed; it's a modified version of the pipeline found in the excellent 3D-Speaker project by Alibaba Research.
My optimizations/modifications were the following:
- changed VAD model
- multi-threaded Fbank feature extraction
- batched inference of CAM++ embeddings model
- clustering is accelerated by RAPIDS, when NVIDIA GPU available
Optimizations aside, massive credit needs to be given to the CAM++ speaker embeddings model, whose efficiency is where the majority of the speed comes from.
Let me know what you think! Were you also frustrated by how slow speaker diarization is? Does Senko's speed unlock new use cases for you? Cheers, everyone.
hamza_q_•5h ago
On M3 MacBook Air, 1 hour in 23.5 seconds (~14x faster).
This is a custom speaker diarization pipeline I've developed; it's a modified version of the pipeline found in the excellent 3D-Speaker project by Alibaba Research.
My optimizations/modifications were the following:
- changed VAD model
- multi-threaded Fbank feature extraction
- batched inference of CAM++ embeddings model
- clustering is accelerated by RAPIDS, when NVIDIA GPU available
Optimizations aside, massive credit needs to be given to the CAM++ speaker embeddings model, whose efficiency is where the majority of the speed comes from.
This pipeline powers the Zanshin media player, which is an attempt at a usable integration of diarization in a media player. Check it out here: https://zanshin.sh And discuss here: https://news.ycombinator.com/item?id=45104866
Let me know what you think! Were you also frustrated by how slow speaker diarization is? Does Senko's speed unlock new use cases for you? Cheers, everyone.