Perhaps half with a web browser to view the results, and half working blind with the numbers alone?
Without having an LLM figure out the required command line parameters? Mad props!
These things are getting really good at just regular transcription (as long as you don't care about verbatimicity), but every additional dimension you add (timestamps, speaker assignment, etc) will make the others worse. These work much better as independent processes that then get reconciled and refined by a multimodal LLM.
simonw•1h ago
Summarizing a 3.5 hour council meeting is something of a holy grail of AI-assisted reporting. There are a LOT of meetings like that, and newspapers (especially smaller ones) can no longer afford to have a human reporter sit through them all.
I tried this prompt (against audio from https://www.youtube.com/watch?v=qgJ7x7R6gy0):
Here's the result: https://gist.github.com/simonw/0b7bc23adb6698f376aebfd700943...I'm not sure quite how to grade it here, especially since I haven't sat through the whole 3.5 hour meeting video myself.
It appears to have captured the gist of the meeting very well, but the fact that the transcript isn't close to an exact match to what was said - and the timestamps are incorrect - means it's very hard to trust the output. Could it have hallucinated things that didn't happen? Those can at least be spotted by digging into the video (or the YouTube transcript) to check that they occurred... but what about if there was a key point that Gemini 3 omitted entirely?
WesleyLivesay•1h ago
Almost makes me wonder if it is behind the scenes doing something similar to: rough transcript -> Summaries -> transcript with timecodes (runs out of context) -> throws timestamps that it has on summaries.
I would be very curious to see if it does better on something like an hour long chunk of audio, to see if it is just some sort of context issue. Or if this same audio was fed to it in say 45 minute chunks to see if the timestamps fix themselves.
byt3bl33d3r•1h ago
simonw•1h ago
mistercheph•1h ago
simonw•1h ago
The big benefit of Gemini for this is that it appears to do a great job of speaker recognition, plus it can identify when people interrupt each other or raise their voices.
The best solution would likely include a mixture of both - Gemini for the speaker identification and tone-of-voice stuff, Whisper or NVIDIA Parakeet or similar for the transcription with timestamps.
rahimnathwani•1h ago
If you need diarization, you can use something like https://github.com/m-bain/whisperX
pants2•1h ago
crazysim•59m ago
ks2048•1h ago
simonw•1h ago
"Real-World Speech-to-text API Leaderboard" - it includes scores for Gemini 2.5 Pro and Flash.
Workaccount2•1h ago
I wonder if you put the audio into a video that is nothing but a black screen with a timer running, it would be able to correctly timestamp.
simonw•1h ago
minimaxir•54m ago
potatolicious•44m ago
IMO the right way to do this is to feed the audio into a transcription model, specifically one that supports diarization (separation of multiple speakers). This will give you a high quality raw transcript that is pretty much exactly what was actually said.
It would be rough in places (i.e., Speaker 1, Speaker 2, etc. rather than actual speaker names)
Then you want to post-process with a LLM to re-annotate the transcript and clean it up (e.g., replace "Speaker 1" with "Mayor Bob"), and query against it.
I see another post here complaining that direct-to-LLM beats a transcription model like Whisper - I would challenge that. Any modern ASR model will do a very, very good job with 95%+ accuracy.
simonw•23m ago
(Update: I just updated MacWhisper and it can now run Parakeet which appears to have decent diarization built in, screenshot here: https://static.simonwillison.net/static/2025/macwhisper-para... )
darkwater•16m ago
sillyfluke•13m ago
LLM summarization is utterly useless when you want 100% accuracy on the final binding decisions on things like council meeting decisions. My experience has been that LLMs cannot be trusted to follow convulted discussions, including revisting earlier agenda items later in the meeting etc.
With transcriptions, the catastrophic risk is far less since I'm doing the summarizing from a transcript myself. But in that case, for an auto-generated transcript, I'll take correct timestamps with gibberish sounding sentences over incorrect timestamps with "convincing" sounding but halluncinated sentences any day.
Any LLM summarization of a sufficiently important meeting requires second-by-second human verification of the audio recording. I have yet to see this convincingly refuted (ie, an LLM model that maintains 100% accuracy on summarizing meeting decisions consistently).
simonw•12m ago