Trying out Gemini 3 Pro with audio transcription and a new pelican benchmark

https://simonwillison.net/2025/Nov/18/gemini-3/

57•nabla9•2h ago

Comments

simonw•1h ago

The audio transcript exercise here is particularly interesting from a journalism perspective.

Summarizing a 3.5 hour council meeting is something of a holy grail of AI-assisted reporting. There are a LOT of meetings like that, and newspapers (especially smaller ones) can no longer afford to have a human reporter sit through them all.

I tried this prompt (against audio from https://www.youtube.com/watch?v=qgJ7x7R6gy0):

  Output a Markdown transcript of this meeting. Include speaker
  names and timestamps. Start with an outline of the key
  meeting sections, each with a title and summary and timestamp
  and list of participating names. Note in bold if anyone
  raised their voices, interrupted each other or had
  disagreements. Then follow with the full transcript.

Here's the result: https://gist.github.com/simonw/0b7bc23adb6698f376aebfd700943...

I'm not sure quite how to grade it here, especially since I haven't sat through the whole 3.5 hour meeting video myself.

It appears to have captured the gist of the meeting very well, but the fact that the transcript isn't close to an exact match to what was said - and the timestamps are incorrect - means it's very hard to trust the output. Could it have hallucinated things that didn't happen? Those can at least be spotted by digging into the video (or the YouTube transcript) to check that they occurred... but what about if there was a key point that Gemini 3 omitted entirely?

WesleyLivesay•1h ago

I think it appears to have done a good job of summarizing the points that it summarize, at least judging from my quick watch of a few sections and from the YT Transcript (which seems quite accurate).

Almost makes me wonder if it is behind the scenes doing something similar to: rough transcript -> Summaries -> transcript with timecodes (runs out of context) -> throws timestamps that it has on summaries.

I would be very curious to see if it does better on something like an hour long chunk of audio, to see if it is just some sort of context issue. Or if this same audio was fed to it in say 45 minute chunks to see if the timestamps fix themselves.

byt3bl33d3r•1h ago

I’ve been meaning to create & publish a structured extraction benchmark for a while. Using LLMs to extract info/entities/connections from large amounts of unstructured data is also a huge boon to AI-assisted reporting and has also a number of cybersecurity applications. Gemini 2.5 was pretty good but so far I have yet to see an LLM that can reliably , accurately and consistently do this

simonw•1h ago

This would be extremely useful. I think this is one of the most commercially valuable uses of these kinds of models, having more solid independent benchmarks would be great.

mistercheph•1h ago

For this use case I think best bet is still a toolchain with a transcription model like whisper fed into an LLM to summarize

simonw•1h ago

Yeah I agree. I ran Whisper (via MacWhisper) on the same video and got back accurate timestamps.

The big benefit of Gemini for this is that it appears to do a great job of speaker recognition, plus it can identify when people interrupt each other or raise their voices.

The best solution would likely include a mixture of both - Gemini for the speaker identification and tone-of-voice stuff, Whisper or NVIDIA Parakeet or similar for the transcription with timestamps.

rahimnathwani•1h ago

For this use case, why not use Whisper to transcribe the audio, and then an LLM to do a second step (summarization or answering questions or whatever)?

If you need diarization, you can use something like https://github.com/m-bain/whisperX

pants2•1h ago

Whisper simply isn't very good compared to LLM audio transcription like gpt-4o-transcribe. If Gemini 3 is even better it's a game-changer.

crazysim•59m ago

Since Gemini seems to be sucking at timestamps, perhaps Whisper can be used to help ground that as an additional input alongside the audio.

ks2048•1h ago

Does anyone benchmark these models for text-to-speech using traditional word-error-rates? It seems audio-input Gemini is a lot cheaper than Google Speech-to-text.

simonw•1h ago

Here's one: https://voicewriter.io/speech-recognition-leaderboard

"Real-World Speech-to-text API Leaderboard" - it includes scores for Gemini 2.5 Pro and Flash.

Workaccount2•1h ago

My assumption is that Gemini has no insight into the time stamps, and instead is ballparking it based on how much context has been analyzed up to that point.

I wonder if you put the audio into a video that is nothing but a black screen with a timer running, it would be able to correctly timestamp.

simonw•1h ago

The Gemini documentation specifically mentions timestamp awareness here: https://ai.google.dev/gemini-api/docs/audio

minimaxir•54m ago

Per the docs, Gemini represents each second of audio as 32 tokens. Since it's a consistent amount, as long as the model is trained to understand the relation between timestamps and the number of tokens (which per Simon's link it does), it should be able to infer the correct amount of seconds.

potatolicious•44m ago

You really want to break a task like this down to constituent parts - especially because in this case the "end to end" way of doing it (i.e., raw audio to summary) doesn't actually get you anything.

IMO the right way to do this is to feed the audio into a transcription model, specifically one that supports diarization (separation of multiple speakers). This will give you a high quality raw transcript that is pretty much exactly what was actually said.

It would be rough in places (i.e., Speaker 1, Speaker 2, etc. rather than actual speaker names)

Then you want to post-process with a LLM to re-annotate the transcript and clean it up (e.g., replace "Speaker 1" with "Mayor Bob"), and query against it.

I see another post here complaining that direct-to-LLM beats a transcription model like Whisper - I would challenge that. Any modern ASR model will do a very, very good job with 95%+ accuracy.

simonw•23m ago

Which diarization models would you recommend, especially for running on macOS?

(Update: I just updated MacWhisper and it can now run Parakeet which appears to have decent diarization built in, screenshot here: https://static.simonwillison.net/static/2025/macwhisper-para... )

darkwater•16m ago

Why can't Gemini, the product, do that by itself? Isn't the point of all this AI hype to easily automate things with low effort?

sillyfluke•13m ago

I'm curious when we started conflating transcription and summarization when discussing this LLM mess, or maybe I'm confused about the output simonw is quoting as "the transcript" which starts off not with the actual transcript but with a Meeting Outline and Summarization sections?

LLM summarization is utterly useless when you want 100% accuracy on the final binding decisions on things like council meeting decisions. My experience has been that LLMs cannot be trusted to follow convulted discussions, including revisting earlier agenda items later in the meeting etc.

With transcriptions, the catastrophic risk is far less since I'm doing the summarizing from a transcript myself. But in that case, for an auto-generated transcript, I'll take correct timestamps with gibberish sounding sentences over incorrect timestamps with "convincing" sounding but halluncinated sentences any day.

Any LLM summarization of a sufficiently important meeting requires second-by-second human verification of the audio recording. I have yet to see this convincingly refuted (ie, an LLM model that maintains 100% accuracy on summarizing meeting decisions consistently).

simonw•12m ago

That's why I shared these results. Understanding the difference between LLM summarization and exact transcriptions is really important for this kind of activity.

londons_explore•1h ago

Anyone got a class full of students and able to get a human version of this pelican benchmark?

Perhaps half with a web browser to view the results, and half working blind with the numbers alone?

ZeroConcerns•56m ago

> so I shrunk the file down to a more manageable 38MB using ffmpeg

Without having an LLM figure out the required command line parameters? Mad props!

simonw•22m ago

Hah, nope! I had Claude Code figure that one out.

leetharris•33m ago

I used to work in ASR. Due to the nature of current multimodal architectures, it is unlikely we'll ever see accurate timestamps over a longer horizon. You're better off using encoder-decoder ASR architectures, then using traditional diarization using embedding clustering, then using a multimodal model to refine it, then use a forced alignment technique (maybe even something pre-NN) to get proper timestamps and reconciling it at the end.

These things are getting really good at just regular transcription (as long as you don't care about verbatimicity), but every additional dimension you add (timestamps, speaker assignment, etc) will make the others worse. These work much better as independent processes that then get reconciled and refined by a multimodal LLM.

nurumaik•13m ago

Seems like pelican benchmark is finally added to model training process

Wowfunhappy•4m ago

Aww, I don’t like the new pelican benchmark as much. I liked that the old prompt was vague and we could see how the AI interpreted it.

Gemini 3

Google Antigravity

GitHub: Git Operation Failures

Pebble, Rebble, and a path forward

I am stepping down as the CEO of Mastodon

The Final Straw: Why Companies Replace Once-Beloved Technology Brands

GitHub Down

OrthoRoute – GPU-accelerated autorouting for KiCad

Cloudflare Global Network experiencing issues

Gemini 3 Pro Model Card [pdf]

The code and open-source tools I used to produce a science fiction anthology

Show HN: Guts – convert Golang types to TypeScript

Solving a million-step LLM task with zero errors

How Quake.exe got its TCP/IP stack

Show HN: RowboatX – open-source Claude Code for everyday automations

Chuck Moore: Colorforth has stopped working [video]

Oracle is underwater on its 'astonishing' $300B OpenAI deal

Trying out Gemini 3 Pro with audio transcription and a new pelican benchmark

Mysterious holes in the Andes may have been an ancient marketplace

Strix Halo's Memory Subsystem: Tackling iGPU Challenges

Short Little Difficult Books

When 1+1+1 Equals 1

A 'small' vanilla Kubernetes install on NixOS

Nearly all UK drivers say headlights are too bright

Show HN: Tokenflood – simulate arbitrary loads on instruction-tuned LLMs

Google boss says AI investment boom has 'elements of irrationality'

Experiment: Making TypeScript immutable-by-default

The Miracle of Wörgl

Mathematics and Computation (2019) [pdf]

A day at Hetzner Online in the Falkenstein data center

Gemini 3

Google Antigravity

GitHub: Git Operation Failures

Pebble, Rebble, and a path forward

I am stepping down as the CEO of Mastodon

The Final Straw: Why Companies Replace Once-Beloved Technology Brands

GitHub Down

OrthoRoute – GPU-accelerated autorouting for KiCad

Cloudflare Global Network experiencing issues

Gemini 3 Pro Model Card [pdf]

The code and open-source tools I used to produce a science fiction anthology

Show HN: Guts – convert Golang types to TypeScript

Solving a million-step LLM task with zero errors

How Quake.exe got its TCP/IP stack

Show HN: RowboatX – open-source Claude Code for everyday automations

Chuck Moore: Colorforth has stopped working [video]

Oracle is underwater on its 'astonishing' $300B OpenAI deal

Trying out Gemini 3 Pro with audio transcription and a new pelican benchmark

Mysterious holes in the Andes may have been an ancient marketplace

Strix Halo's Memory Subsystem: Tackling iGPU Challenges

Short Little Difficult Books

When 1+1+1 Equals 1

A 'small' vanilla Kubernetes install on NixOS

Nearly all UK drivers say headlights are too bright

Show HN: Tokenflood – simulate arbitrary loads on instruction-tuned LLMs

Google boss says AI investment boom has 'elements of irrationality'

Experiment: Making TypeScript immutable-by-default

The Miracle of Wörgl

Mathematics and Computation (2019) [pdf]

A day at Hetzner Online in the Falkenstein data center

Trying out Gemini 3 Pro with audio transcription and a new pelican benchmark

Comments