The audio transcript exercise here is particularly interesting from a journalism perspective.
Summarizing a 3.5 hour council meeting is something of a holy grail of AI-assisted reporting. There are a LOT of meetings like that, and newspapers (especially smaller ones) can no longer afford to have a human reporter sit through them all.
Output a Markdown transcript of this meeting. Include speaker
names and timestamps. Start with an outline of the key
meeting sections, each with a title and summary and timestamp
and list of participating names. Note in bold if anyone
raised their voices, interrupted each other or had
disagreements. Then follow with the full transcript.
I'm not sure quite how to grade it here, especially since I haven't sat through the whole 3.5 hour meeting video myself.
It appears to have captured the gist of the meeting very well, but the fact that the transcript isn't close to an exact match to what was said - and the timestamps are incorrect - means it's very hard to trust the output. Could it have hallucinated things that didn't happen? Those can at least be spotted by digging into the video (or the YouTube transcript) to check that they occurred... but what about if there was a key point that Gemini 3 omitted entirely?
WesleyLivesay•27m ago
I think it appears to have done a good job of summarizing the points that it summarize, at least judging from my quick watch of a few sections and from the YT Transcript (which seems quite accurate).
Almost makes me wonder if it is behind the scenes doing something similar to: rough transcript -> Summaries -> transcript with timecodes (runs out of context) -> throws timestamps that it has on summaries.
I would be very curious to see if it does better on something like an hour long chunk of audio, to see if it is just some sort of context issue. Or if this same audio was fed to it in say 45 minute chunks to see if the timestamps fix themselves.
byt3bl33d3r•25m ago
I’ve been meaning to create & publish a structured extraction benchmark for a while. Using LLMs to extract info/entities/connections from large amounts of unstructured data is also a huge boon to AI-assisted reporting and has also a number of cybersecurity applications. Gemini 2.5 was pretty good but so far I have yet to see an LLM that can reliably , accurately and consistently do this
simonw•6m ago
This would be extremely useful. I think this is one of the most commercially valuable uses of these kinds of models, having more solid independent benchmarks would be great.
mistercheph•22m ago
For this use case I think best bet is still a toolchain with a transcription model like whisper fed into an LLM to summarize
simonw•20m ago
Yeah I agree. I ran Whisper (via MacWhisper) on the same video and got back accurate timestamps.
The big benefit of Gemini for this is that it appears to do a great job of speaker recognition, plus it can identify when people interrupt each other or raise their voices.
The best solution would likely include a mixture of both - Gemini for the speaker identification and tone-of-voice stuff, Whisper or NVIDIA Parakeet or similar for the transcription with timestamps.
rahimnathwani•21m ago
For this use case, why not use Whisper to transcribe the audio, and then an LLM to do a second step (summarization or answering questions or whatever)?
Whisper simply isn't very good compared to LLM audio transcription like gpt-4o-transcribe. If Gemini 3 is even better it's a game-changer.
ks2048•16m ago
Does anyone benchmark these models for text-to-speech using traditional word-error-rates? It seems audio-input Gemini is a lot cheaper than Google Speech-to-text.
"Real-World Speech-to-text API Leaderboard" - it includes scores for Gemini 2.5 Pro and Flash.
Workaccount2•15m ago
My assumption is that Gemini has no insight into the time stamps, and instead is ballparking it based on how much context has been analyzed up to that point.
I wonder if you put the audio into a video that is nothing but a black screen with a timer running, it would be able to correctly timestamp.
londons_explore•19m ago
Anyone got a class full of students and able to get a human version of this pelican benchmark?
Perhaps half with a web browser to view the results, and half working blind with the numbers alone?
simonw•41m ago
Summarizing a 3.5 hour council meeting is something of a holy grail of AI-assisted reporting. There are a LOT of meetings like that, and newspapers (especially smaller ones) can no longer afford to have a human reporter sit through them all.
I tried this prompt (against audio from https://www.youtube.com/watch?v=qgJ7x7R6gy0):
Here's the result: https://gist.github.com/simonw/0b7bc23adb6698f376aebfd700943...I'm not sure quite how to grade it here, especially since I haven't sat through the whole 3.5 hour meeting video myself.
It appears to have captured the gist of the meeting very well, but the fact that the transcript isn't close to an exact match to what was said - and the timestamps are incorrect - means it's very hard to trust the output. Could it have hallucinated things that didn't happen? Those can at least be spotted by digging into the video (or the YouTube transcript) to check that they occurred... but what about if there was a key point that Gemini 3 omitted entirely?
WesleyLivesay•27m ago
Almost makes me wonder if it is behind the scenes doing something similar to: rough transcript -> Summaries -> transcript with timecodes (runs out of context) -> throws timestamps that it has on summaries.
I would be very curious to see if it does better on something like an hour long chunk of audio, to see if it is just some sort of context issue. Or if this same audio was fed to it in say 45 minute chunks to see if the timestamps fix themselves.
byt3bl33d3r•25m ago
simonw•6m ago
mistercheph•22m ago
simonw•20m ago
The big benefit of Gemini for this is that it appears to do a great job of speaker recognition, plus it can identify when people interrupt each other or raise their voices.
The best solution would likely include a mixture of both - Gemini for the speaker identification and tone-of-voice stuff, Whisper or NVIDIA Parakeet or similar for the transcription with timestamps.
rahimnathwani•21m ago
If you need diarization, you can use something like https://github.com/m-bain/whisperX
pants2•8m ago
ks2048•16m ago
simonw•3m ago
"Real-World Speech-to-text API Leaderboard" - it includes scores for Gemini 2.5 Pro and Flash.
Workaccount2•15m ago
I wonder if you put the audio into a video that is nothing but a black screen with a timer running, it would be able to correctly timestamp.