Trying out Gemini 3 Pro with audio transcription and a new pelican benchmark

https://simonwillison.net/2025/Nov/18/gemini-3/

31•nabla9•1h ago

Comments

simonw•41m ago

The audio transcript exercise here is particularly interesting from a journalism perspective.

Summarizing a 3.5 hour council meeting is something of a holy grail of AI-assisted reporting. There are a LOT of meetings like that, and newspapers (especially smaller ones) can no longer afford to have a human reporter sit through them all.

I tried this prompt (against audio from https://www.youtube.com/watch?v=qgJ7x7R6gy0):

  Output a Markdown transcript of this meeting. Include speaker
  names and timestamps. Start with an outline of the key
  meeting sections, each with a title and summary and timestamp
  and list of participating names. Note in bold if anyone
  raised their voices, interrupted each other or had
  disagreements. Then follow with the full transcript.

Here's the result: https://gist.github.com/simonw/0b7bc23adb6698f376aebfd700943...

I'm not sure quite how to grade it here, especially since I haven't sat through the whole 3.5 hour meeting video myself.

It appears to have captured the gist of the meeting very well, but the fact that the transcript isn't close to an exact match to what was said - and the timestamps are incorrect - means it's very hard to trust the output. Could it have hallucinated things that didn't happen? Those can at least be spotted by digging into the video (or the YouTube transcript) to check that they occurred... but what about if there was a key point that Gemini 3 omitted entirely?

WesleyLivesay•27m ago

I think it appears to have done a good job of summarizing the points that it summarize, at least judging from my quick watch of a few sections and from the YT Transcript (which seems quite accurate).

Almost makes me wonder if it is behind the scenes doing something similar to: rough transcript -> Summaries -> transcript with timecodes (runs out of context) -> throws timestamps that it has on summaries.

I would be very curious to see if it does better on something like an hour long chunk of audio, to see if it is just some sort of context issue. Or if this same audio was fed to it in say 45 minute chunks to see if the timestamps fix themselves.

byt3bl33d3r•25m ago

I’ve been meaning to create & publish a structured extraction benchmark for a while. Using LLMs to extract info/entities/connections from large amounts of unstructured data is also a huge boon to AI-assisted reporting and has also a number of cybersecurity applications. Gemini 2.5 was pretty good but so far I have yet to see an LLM that can reliably , accurately and consistently do this

simonw•6m ago

This would be extremely useful. I think this is one of the most commercially valuable uses of these kinds of models, having more solid independent benchmarks would be great.

mistercheph•22m ago

For this use case I think best bet is still a toolchain with a transcription model like whisper fed into an LLM to summarize

simonw•20m ago

Yeah I agree. I ran Whisper (via MacWhisper) on the same video and got back accurate timestamps.

The big benefit of Gemini for this is that it appears to do a great job of speaker recognition, plus it can identify when people interrupt each other or raise their voices.

The best solution would likely include a mixture of both - Gemini for the speaker identification and tone-of-voice stuff, Whisper or NVIDIA Parakeet or similar for the transcription with timestamps.

rahimnathwani•21m ago

For this use case, why not use Whisper to transcribe the audio, and then an LLM to do a second step (summarization or answering questions or whatever)?

If you need diarization, you can use something like https://github.com/m-bain/whisperX

pants2•8m ago

Whisper simply isn't very good compared to LLM audio transcription like gpt-4o-transcribe. If Gemini 3 is even better it's a game-changer.

ks2048•16m ago

Does anyone benchmark these models for text-to-speech using traditional word-error-rates? It seems audio-input Gemini is a lot cheaper than Google Speech-to-text.

simonw•3m ago

Here's one: https://voicewriter.io/speech-recognition-leaderboard

"Real-World Speech-to-text API Leaderboard" - it includes scores for Gemini 2.5 Pro and Flash.

Workaccount2•15m ago

My assumption is that Gemini has no insight into the time stamps, and instead is ballparking it based on how much context has been analyzed up to that point.

I wonder if you put the audio into a video that is nothing but a black screen with a timer running, it would be able to correctly timestamp.

londons_explore•19m ago

Anyone got a class full of students and able to get a human version of this pelican benchmark?

Perhaps half with a web browser to view the results, and half working blind with the numbers alone?

The Unraveling of the Justice Department: 60 attorneys describe a year of chaos

The Final Straw: Why Companies Replace Once-Beloved Technology Brands

Gimp 3.2 RC1: First Release Candidate for Gimp 3.2

Embedded Swift Improvements Coming in Swift 6.3

Reentry – A Space Flight Simulator

fx – an efficient (micro)blogging service that you can self-host

The productivity impact of coding agents

Convergence: Run queries across models and analyze inconsistencies

Stress Is an Ancient Superpower That Is Slowly Killing You

Struggling to track AI agents? This tool gives you a single source of truth

AI Uncovers Evidence of Life in 3.3B-Year-Old Rocks

A surprise with how ' ' handles its program argument in practice

OpenBSD Amsterdam, OpenBSD VPS Hosting

Book Reports Potentially Copyright Infringing, Thanks to Court Attacks on LLMs

Cobalt 200: Azure's next cloud-native CPU Hub

World Labs – Building 3D spatial-AI world models

The Atom Bomb and Japanese Christianity [pdf]

The House Draws the Line at Jeffrey Epstein

Dr. Fei-Fei Li on jobs, robots and why world models are next

The False Glorification of Yann LeCun

Paiml/Depyler: Compiles Python to Rust, Helping Transition to Rust Code

Beyond the Primary User: 3 Types of Smart-Home Users

Show HN: Polymarket/Kalshi Arbitrage Scanner Powered by Gemini Pro 3

Cloudflare CTO: This was not an attack

Chicken Caesars: they're messing with your Bluesky feed

Text to CAD for Aircraft Design

Meta Did Not Violate Antitrust Law, Judge Rules

Intel Lass Feature Looks Like It Will Be Upstreamed for Linux 6.19

Red Hat Losing Another Longtime and Prominent Linux Kernel Engineer

A Week with Elixir (2013)