frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Ask HN: Why is virtualization still not solved?

16•prmph•14h ago•23 comments

Ask HN: What is so good about MCP servers?

36•metadat•1d ago•34 comments

Ask HN: How many of you are working in tech without a STEM degree?

37•zebproj•3d ago•48 comments

Ask HN: Cyber Resilience Act – what is "buying" software?

3•leksak•16h ago•1 comments

Ask HN: Why do Cursor, Windsurf and Claude Code dominate the conversation?

27•bluelightning2k•4d ago•37 comments

Ask HN: Good Starting OS for Children?

7•kqr•1d ago•11 comments

Recovering Files Through Screenshots

4•ccnvms•20h ago•0 comments

Ask HN: Help me navigate a PIP at a remote startup in the Netherlands

18•msoad•1d ago•17 comments

Remove All AI Features from Firefox

43•nabla9•2d ago•7 comments

Tell HN: Online Safety Act to be enforced in the UK on July 25th

15•trycatchthroawy•1d ago•6 comments

I'm Peter Roberts, immigration attorney who does work for YC and startups. AMA

162•proberts•1w ago•265 comments

The missing analytics platform for landing pages. What do you think?

4•coursecrumbs•1d ago•10 comments

Ask HN: Why is Gmail so incompetent at basic search?

57•sn9•4d ago•57 comments

Ask HN: Python developers at big companies what is your setup?

33•ravshan•3d ago•33 comments

Ask HN: Any active COBOL devs here? What are you working on?

243•_false•1w ago•188 comments

Ask HN: Is anybody using llama.cpp for production?

11•HardikVala•2d ago•1 comments

Mineral exploration startups are the tech startups of the physical world

4•unicorn_chaser•1d ago•0 comments

Ask HN: What's Your Useful Local LLM Stack?

91•Olshansky•1w ago•52 comments

Ask HN: Has anyone deployed LLMs to production?

14•saaspirant•2d ago•6 comments

Ask HN: How did you navigate an illegal termination?

5•infoseekadvice•1d ago•4 comments
Open in hackernews

Ask HN: What Speaker Diarization tools should I look into?

11•justforfunhere•4d ago
Hi,

I am making a tool that needs to analyze a conversation (non-English) between two people. The conversation is provided to me in audio format. I am currently using OpenAI Whisper to transcribe and feed the transcription to ChatGPT-4o model through the API for analysis.

So far, it's doing a fair job. Sometimes, though, reading the transcription, I find it hard to figure out which speaker is speaking what. I have to listen to the audio to figure it out. I am wondering if ChatGPT-4o would also sometimes find it hard to follow the conversation from the transcription. I think that adding a speaker diarization step might make the transcription easier to understand and analyze.

I am looking for Speaker Diarization tools that I can use. I have tried using pyannote speaker-diarization-3.1, but I find it does not work very well. What are some other options that I can look at?

Comments

nemima•4d ago
Hi, I'm an engineer at Speechmatics. Our speech-to-text software handles speaker diarization very reliably, and we're a go-to choice for non-English languages. https://www.speechmatics.com/

How long is the audio file? If it's under 2 hours, you can upload the file and transcribe it with diarization for free using our web portal: https://portal.speechmatics.com/jobs/create/batch

Hope it helps for your use case! If it does, and you encounter any issues, drop us an email at devrel@speechmatics.com :)

EDIT: typo

justforfunhere•4d ago
Hi, yes, it is well under two hours. The longest audio that I have had to handle as of now is around 10 minutes.

I will give your portal a try soon. Thanks

hildekominskia•4d ago
Skip pyannote 3.1; two battle-tested upgrades:

1. NVIDIA NeMo’s `diar_msdd_telephonic` (8 kHz) or `diar_msdd_mic` (16 kHz) — one-line Python install, GPU optional, beats pyannote on cross-talk. 2. AssemblyAI’s async `/v2/transcript` endpoint — gives you `words[].speaker` + Whisper-level accuracy for 40+ languages. Free tier: 3 h / month.

Glue either to your existing Whisper pipeline and feed ChatGPT-4o with speaker-tagged text. The jump in clarity is night-and-day.

I use the same combo to auto-caption interviews, then drop the synced footage into Veo 3 (https://veo-3.app) for instant talking-head explainers—works even for non-English audio.

hbredin•3d ago
Hey, I am the creator of pyannote open-source toolkit.

I just created a company around it that serves much better diarization models through an API.

You can test it by creating an account on https://dashboard.pyannote.ai. You'll get 150h of diarization for free.

There is also a playground where you can simply upload a file and visualize the diarization results.

satvikpendem•3d ago
Seems like this only diarizes, is there a transcription interface as well? The prices are a bit high for only diarization as something like Soniox is also ~13 cents for real-time diarization with transcription included.
satvikpendem•3d ago
Google Gemini and ElevenLabs are quite good at transcription with diarization if you already have the audiofile. For real-time, I like Soniox, you can use their comparison page that runs all the major transcription services at once [0]. Note that their Google model is not Gemini, it's their older Chirp model.

[0] https://soniox.com/compare/

vismit2000•3d ago
Elevenlabs does speaker diarization really well in my experience: https://elevenlabs.io/ (First came to know about this from Lex-Modi podcast)
meerab•2d ago
I am building VideoToBe.com - I have found that whisperX works the most reliable.

https://github.com/m-bain/whisperX

It is built on top of OpenAI Whisper, so speech recognition is good, the transcript gives speaker tags as 'SPEAKER_00' and 'SPEAKER_01' etc.

Here is how the transcript may look like

https://videotobe.com/play/media/1b02f75a-9503-43aa-8956-d18...