frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Trying out Gemini 3 Pro with audio transcription and a new pelican benchmark

https://simonwillison.net/2025/Nov/18/gemini-3/
31•nabla9•1h ago

Comments

simonw•41m ago
The audio transcript exercise here is particularly interesting from a journalism perspective.

Summarizing a 3.5 hour council meeting is something of a holy grail of AI-assisted reporting. There are a LOT of meetings like that, and newspapers (especially smaller ones) can no longer afford to have a human reporter sit through them all.

I tried this prompt (against audio from https://www.youtube.com/watch?v=qgJ7x7R6gy0):

  Output a Markdown transcript of this meeting. Include speaker
  names and timestamps. Start with an outline of the key
  meeting sections, each with a title and summary and timestamp
  and list of participating names. Note in bold if anyone
  raised their voices, interrupted each other or had
  disagreements. Then follow with the full transcript.
Here's the result: https://gist.github.com/simonw/0b7bc23adb6698f376aebfd700943...

I'm not sure quite how to grade it here, especially since I haven't sat through the whole 3.5 hour meeting video myself.

It appears to have captured the gist of the meeting very well, but the fact that the transcript isn't close to an exact match to what was said - and the timestamps are incorrect - means it's very hard to trust the output. Could it have hallucinated things that didn't happen? Those can at least be spotted by digging into the video (or the YouTube transcript) to check that they occurred... but what about if there was a key point that Gemini 3 omitted entirely?

WesleyLivesay•27m ago
I think it appears to have done a good job of summarizing the points that it summarize, at least judging from my quick watch of a few sections and from the YT Transcript (which seems quite accurate).

Almost makes me wonder if it is behind the scenes doing something similar to: rough transcript -> Summaries -> transcript with timecodes (runs out of context) -> throws timestamps that it has on summaries.

I would be very curious to see if it does better on something like an hour long chunk of audio, to see if it is just some sort of context issue. Or if this same audio was fed to it in say 45 minute chunks to see if the timestamps fix themselves.

byt3bl33d3r•25m ago
I’ve been meaning to create & publish a structured extraction benchmark for a while. Using LLMs to extract info/entities/connections from large amounts of unstructured data is also a huge boon to AI-assisted reporting and has also a number of cybersecurity applications. Gemini 2.5 was pretty good but so far I have yet to see an LLM that can reliably , accurately and consistently do this
simonw•6m ago
This would be extremely useful. I think this is one of the most commercially valuable uses of these kinds of models, having more solid independent benchmarks would be great.
mistercheph•22m ago
For this use case I think best bet is still a toolchain with a transcription model like whisper fed into an LLM to summarize
simonw•20m ago
Yeah I agree. I ran Whisper (via MacWhisper) on the same video and got back accurate timestamps.

The big benefit of Gemini for this is that it appears to do a great job of speaker recognition, plus it can identify when people interrupt each other or raise their voices.

The best solution would likely include a mixture of both - Gemini for the speaker identification and tone-of-voice stuff, Whisper or NVIDIA Parakeet or similar for the transcription with timestamps.

rahimnathwani•21m ago
For this use case, why not use Whisper to transcribe the audio, and then an LLM to do a second step (summarization or answering questions or whatever)?

If you need diarization, you can use something like https://github.com/m-bain/whisperX

pants2•8m ago
Whisper simply isn't very good compared to LLM audio transcription like gpt-4o-transcribe. If Gemini 3 is even better it's a game-changer.
ks2048•16m ago
Does anyone benchmark these models for text-to-speech using traditional word-error-rates? It seems audio-input Gemini is a lot cheaper than Google Speech-to-text.
simonw•3m ago
Here's one: https://voicewriter.io/speech-recognition-leaderboard

"Real-World Speech-to-text API Leaderboard" - it includes scores for Gemini 2.5 Pro and Flash.

Workaccount2•15m ago
My assumption is that Gemini has no insight into the time stamps, and instead is ballparking it based on how much context has been analyzed up to that point.

I wonder if you put the audio into a video that is nothing but a black screen with a timer running, it would be able to correctly timestamp.

londons_explore•19m ago
Anyone got a class full of students and able to get a human version of this pelican benchmark?

Perhaps half with a web browser to view the results, and half working blind with the numbers alone?

The Unraveling of the Justice Department: 60 attorneys describe a year of chaos

https://www.nytimes.com/interactive/2025/11/16/magazine/trump-justice-department-staff-attorneys....
1•tastyface•1m ago•1 comments

The Final Straw: Why Companies Replace Once-Beloved Technology Brands

https://www.functionize.com/blog/the-final-straw-why-companies-replace-once-beloved-technology-br...
1•ohjeez•1m ago•0 comments

Gimp 3.2 RC1: First Release Candidate for Gimp 3.2

https://www.gimp.org/news/2025/11/17/gimp-3-2-RC1-released/
1•marcodiego•2m ago•0 comments

Embedded Swift Improvements Coming in Swift 6.3

https://swift.org/blog/embedded-swift-improvements-coming-in-swift-6.3/
1•pjmlp•4m ago•0 comments

Reentry – A Space Flight Simulator

https://reentrygame.com/
1•nodesocket•4m ago•0 comments

fx – an efficient (micro)blogging service that you can self-host

https://github.com/rikhuijzer/fx
1•indigodaddy•4m ago•0 comments

The productivity impact of coding agents

https://cursor.com/blog/productivity
1•janpio•5m ago•0 comments

Convergence: Run queries across models and analyze inconsistencies

https://github.com/riemannzeta/convergence
1•riemannzeta•7m ago•0 comments

Stress Is an Ancient Superpower That Is Slowly Killing You

https://kottke.org/25/11/stress-is-an-ancient-superpower-that-is-slowly-killing-you
1•ulrischa•7m ago•0 comments

Struggling to track AI agents? This tool gives you a single source of truth

https://www.zdnet.com/article/struggling-to-track-ai-agents-this-open-source-tool-gives-you-a-sin...
1•CrankyBear•8m ago•0 comments

AI Uncovers Evidence of Life in 3.3B-Year-Old Rocks

https://gizmodo.com/ai-uncovers-evidence-of-life-in-3-3-billion-year-old-rocks-2000687539
1•poopcat•8m ago•0 comments

A surprise with how ' ' handles its program argument in practice

https://utcc.utoronto.ca/~cks/space/blog/unix/ShebangRelativePathSurprise?showcomments
1•todsacerdoti•10m ago•0 comments

OpenBSD Amsterdam, OpenBSD VPS Hosting

https://openbsd.amsterdam/
2•doublepg23•10m ago•0 comments

Book Reports Potentially Copyright Infringing, Thanks to Court Attacks on LLMs

https://www.techdirt.com/2025/11/18/book-reports-potentially-copyright-infringing-thanks-to-court...
1•speckx•10m ago•0 comments

Cobalt 200: Azure's next cloud-native CPU Hub

https://techcommunity.microsoft.com/blog/azureinfrastructureblog/announcing-cobalt-200-azure%E2%8...
2•rbanffy•12m ago•0 comments

World Labs – Building 3D spatial-AI world models

https://www.worldlabs.ai/
2•Brysonbw•14m ago•0 comments

The Atom Bomb and Japanese Christianity [pdf]

https://isonomiaquarterly.com/wp-content/uploads/2025/11/iq-3.4-zellen-nagasaki.pdf
1•brandonlc•16m ago•1 comments

The House Draws the Line at Jeffrey Epstein

https://www.bloomberg.com/opinion/articles/2025-11-18/epstein-vote-is-congress-line-in-the-sand-w...
1•wslh•17m ago•1 comments

Dr. Fei-Fei Li on jobs, robots and why world models are next

https://www.youtube.com/watch?v=Ctjiatnd6Xk
1•Brysonbw•17m ago•0 comments

The False Glorification of Yann LeCun

https://garymarcus.substack.com/p/the-false-glorification-of-yann-lecun
1•guilamu•18m ago•0 comments

Paiml/Depyler: Compiles Python to Rust, Helping Transition to Rust Code

https://github.com/paiml/depyler
1•rbanffy•18m ago•0 comments

Beyond the Primary User: 3 Types of Smart-Home Users

https://www.nngroup.com/articles/smart-home-users/
1•ulrischa•19m ago•0 comments

Show HN: Polymarket/Kalshi Arbitrage Scanner Powered by Gemini Pro 3

https://arb.carolinacloud.io/
1•bojangleslover•19m ago•0 comments

Cloudflare CTO: This was not an attack

https://twitter.com/dok2001/status/1990791419653484646
2•doener•20m ago•0 comments

Chicken Caesars: they're messing with your Bluesky feed

https://thedabbler.patatas.ca/pages/bluesky-caesars.html
2•Sophira•24m ago•0 comments

Text to CAD for Aircraft Design

https://strato.so/
1•k1a11220•24m ago•0 comments

Meta Did Not Violate Antitrust Law, Judge Rules

https://www.nytimes.com/2025/11/18/technology/meta-antitrust-monopoly-ruling.html
7•lateforwork•25m ago•1 comments

Intel Lass Feature Looks Like It Will Be Upstreamed for Linux 6.19

https://www.phoronix.com/news/Intel-LASS-For-Linux-6.19
3•doener•25m ago•0 comments

Red Hat Losing Another Longtime and Prominent Linux Kernel Engineer

https://www.phoronix.com/news/Red-Hat-David-H-Leaving
4•Bender•25m ago•0 comments

A Week with Elixir (2013)

https://joearms.github.io/published/2013-05-31-a-week-with-elixir.html
1•giancarlostoro•25m ago•0 comments