Perhaps half with a web browser to view the results, and half working blind with the numbers alone?
0: https://www.wired.com/2016/04/can-draw-bikes-memory-definite...
I imagine that you could theoretically also guess and check without the web browser by manually rendering the SVG using some graph paper, a compass, a straightedge, and coloured pencils, but that sounds unbelievably tedious and also very error-prone.
Without having an LLM figure out the required command line parameters? Mad props!
These things are getting really good at just regular transcription (as long as you don't care about verbatimicity), but every additional dimension you add (timestamps, speaker assignment, etc) will make the others worse. These work much better as independent processes that then get reconciled and refined by a multimodal LLM.
> Truth be told, I’m playing the long game here. All I’ve ever wanted from life is a genuinely great SVG vector illustration of a pelican riding a bicycle. My dastardly multi-year plan is to trick multiple AI labs into investing vast resources to cheat at my benchmark until I get one.
https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
I think a more challenging, well, challenge, would be to offer an even more absurd scenario and see how the model handles it.
Example: generate an svg of a pelican and a mongoose eating popcorn inside a pyramid-shaped vehicle flying around Jupiter. Result: https://imgur.com/a/TBGYChc
I was inspired by Max Woolf's nano banana test prompts: https://minimaxir.com/2025/11/nano-banana-prompts/
Do you think it would be reasonable to include both in future reviews, at least for the sake of back-compatibility (and comparability)?
Love the pivot in pelican generation bench.
I'll need to update for V2!
I noticed that Gemini 3 Pro can no longer recognize the audio files I upload. It just gives me information from old chats or random stuff that isn’t relevant. It can’t even grasp the topic of the class anymore, it just spits out nonsense.
Something has changed. Just few days ago with Gemini 2.5 Pro it worked just fine!
It's not just me: https://www.reddit.com/r/GeminiAI/comments/1p0givt/gemini_25...
But I am trying Gemini 3 pro web (Pro subscription). aistudio is still working fine.
simonw•2mo ago
Summarizing a 3.5 hour council meeting is something of a holy grail of AI-assisted reporting. There are a LOT of meetings like that, and newspapers (especially smaller ones) can no longer afford to have a human reporter sit through them all.
I tried this prompt (against audio from https://www.youtube.com/watch?v=qgJ7x7R6gy0):
Here's the result: https://gist.github.com/simonw/0b7bc23adb6698f376aebfd700943...I'm not sure quite how to grade it here, especially since I haven't sat through the whole 3.5 hour meeting video myself.
It appears to have captured the gist of the meeting very well, but the fact that the transcript isn't close to an exact match to what was said - and the timestamps are incorrect - means it's very hard to trust the output. Could it have hallucinated things that didn't happen? Those can at least be spotted by digging into the video (or the YouTube transcript) to check that they occurred... but what about if there was a key point that Gemini 3 omitted entirely?
WesleyLivesay•2mo ago
Almost makes me wonder if it is behind the scenes doing something similar to: rough transcript -> Summaries -> transcript with timecodes (runs out of context) -> throws timestamps that it has on summaries.
I would be very curious to see if it does better on something like an hour long chunk of audio, to see if it is just some sort of context issue. Or if this same audio was fed to it in say 45 minute chunks to see if the timestamps fix themselves.
byt3bl33d3r•2mo ago
simonw•2mo ago
mistercheph•2mo ago
simonw•2mo ago
The big benefit of Gemini for this is that it appears to do a great job of speaker recognition, plus it can identify when people interrupt each other or raise their voices.
The best solution would likely include a mixture of both - Gemini for the speaker identification and tone-of-voice stuff, Whisper or NVIDIA Parakeet or similar for the transcription with timestamps.
rahimnathwani•2mo ago
If you need diarization, you can use something like https://github.com/m-bain/whisperX
pants2•2mo ago
crazysim•2mo ago
ks2048•2mo ago
simonw•2mo ago
"Real-World Speech-to-text API Leaderboard" - it includes scores for Gemini 2.5 Pro and Flash.
Workaccount2•2mo ago
I wonder if you put the audio into a video that is nothing but a black screen with a timer running, it would be able to correctly timestamp.
simonw•2mo ago
minimaxir•2mo ago
potatolicious•2mo ago
IMO the right way to do this is to feed the audio into a transcription model, specifically one that supports diarization (separation of multiple speakers). This will give you a high quality raw transcript that is pretty much exactly what was actually said.
It would be rough in places (i.e., Speaker 1, Speaker 2, etc. rather than actual speaker names)
Then you want to post-process with a LLM to re-annotate the transcript and clean it up (e.g., replace "Speaker 1" with "Mayor Bob"), and query against it.
I see another post here complaining that direct-to-LLM beats a transcription model like Whisper - I would challenge that. Any modern ASR model will do a very, very good job with 95%+ accuracy.
simonw•2mo ago
(Update: I just updated MacWhisper and it can now run Parakeet which appears to have decent diarization built in, screenshot here: https://static.simonwillison.net/static/2025/macwhisper-para... )
atonse•2mo ago
darkwater•2mo ago
vlovich123•2mo ago
jrk•2mo ago
darkwater•2mo ago
refulgentis•2mo ago
Separately, in my role as wizened 16 year old veteran of HN: it was jarring to read that. There’s a “rules” section, but don’t be turned off by the name, it is more like a nice collection of guidelines of how to interact in a way that encourages productive discussion that illuminates. One of the key rules is not to interpret things weakly. Here, someone spelled out exactly how to do it, and we shouldn’t then assume its not AI, then tie to a vague demeaning description of “AI hype”, then ask an unanswerable question of what’s the point of “AI hype”.
If you’re nontechnical, to be clear, it would be hard to be nontechnical and new to HN and know how to ask that a different way, I suppose.
darkwater•2mo ago
I think you misunderstood my comment. https://news.ycombinator.com/item?id=45973656 has got the right reading of it.
sillyfluke•2mo ago
LLM summarization is utterly useless when you want 100% accuracy on the final binding decisions on things like council meeting decisions. My experience has been that LLMs cannot be trusted to follow convulted discussions, including revisting earlier agenda items later in the meeting etc.
With transcriptions, the catastrophic risk is far less since I'm doing the summarizing from a transcript myself. But in that case, for an auto-generated transcript, I'll take correct timestamps with gibberish sounding sentences over incorrect timestamps with "convincing" sounding but halluncinated sentences any day.
Any LLM summarization of a sufficiently important meeting requires second-by-second human verification of the audio recording. I have yet to see this convincingly refuted (ie, an LLM model that maintains 100% accuracy on summarizing meeting decisions consistently).
simonw•2mo ago
Royce-CMR•2mo ago
But not impossible. I’ve had success with prompts that ID all topics and then map all conversation tied to each topic (each seperate LLM queries) and then pulling together summary and conclusions by topic.
I’ve also had success with one shot prompts - especially with the right context on the event and phrasing shared. But honestly I end up spending about 5-10 min reviewing and cleaning up the output before solid.
But that’s worlds better than attending the event, and then manually pulling together notes from your fast in flight shorthand.
(Former BA, ran JADs etc, lived and died by accuracy and right color / expression / context in notes)
stavros•2mo ago
simonw•2mo ago
luke-stanley•2mo ago
Since the plain decoder models stole the show, Google DeepMind demonstrated a way to adapt LLMs,adding a T5 encoder to an existing normal Gemma model to get the benefits of the more grounded text-to-text tasks WITHOUT instruction tuning (and the increased risk of prompt injection). They also have a few different kinds they shared on HuggingFace. I didn't get around to fine-tuning the weights of one for summarisation yet but it could well be a good way for more reliable summarisation. I did try out some models for inference though and made a Gist here, which is useful since I found the HF default code example a bit broken:
https://gist.github.com/lukestanley/ee89758ea315b68fd66ba52c...
Google's minisite: https://deepmind.google/models/gemma/t5gemma/ Paper: https://arxiv.org/abs/2504.06225
Here is one such model that didn't hallucinate and actually did summarise on HF: https://huggingface.co/google/t5gemma-l-l-prefixlm
luke-stanley•2mo ago
barapa•2mo ago
anilgulecha•2mo ago
The cutting step is simple, and token count is pretty much the same, but the crucial additional detail allows for excellent transcription fidelity time wise.
We've also experimented passing in regular TTS (non-llm) transcript for reference, which again helps the LLM do better.
theshrike79•2mo ago
It goes through the video analysis process and terminology step by step. You need to say "timecode" instead of "timestamp" so the LLM aligns better to AV-stuff more than programming (weird, right?)
Basically you need two to three passes to get a proper transcription
1. just listen and identify voices, give them some kind of ID 2. Maybe watch the video and see if the people have name tags on the table in front of them or there's an on-screen overlay (news or interviews) 3. Last pass, go through the transcription and map voices+ids to actual names