The last six months in LLMs, illustrated by pelicans on bicycles

https://simonwillison.net/2025/Jun/6/six-months-in-llms/

81•swyx•4h ago

Comments

neepi•1h ago

My only take home is they are all terrible and I should hire a professional.

dist-epoch•1h ago

Most of them are text-only models. Like asking a person born blind to draw a pelican, based on what they heard it looks like.

neepi•49m ago

That seems to be a completely inappropriate use case?

I would not hire a blind artist or a deaf musician.

dist-epoch•44m ago

The point is about exploring the capabilities of the model.

Like asking you to draw a 2D projection of 4D sphere intersected with a 4D torus or something.

namibj•43m ago

It's a proxy for abstract designing, like writing software or designing in a parametric CAD.

Most the non-math design work of applied engineering AFAIK falls under the umbrella that's tested with the pelican riding the bicycle. You have to make a mental model and then turn it into applicable instructions.

Program code/SVG markup/parametric CAD instructions don't really differ in that aspect.

neepi•23m ago

I would not assume that this methodology applies to applied engineering, as a former actual real tangible meat space engineer. Things are a little nuanced and the nuances come from a combination of communication and experience, neither of which any LLM has any insight into at all. It's not out there on the internet to train it with and it's not even easy to put it into abstract terms which can be used as training data. And engineering itself in isolation doesn't exist - there is a whole world around it.

Ergo no you can't just say throw a bicycle into an LLM and a parametric model drops out into solidworks, then a machine makes it. And everyone buys it. That is the hope really isn't it? You end up with a useless shitty bike with a shit pelican on it.

The biggest problem we have in the LLM space is the fact that no one really knows any of the proposed use cases enough and neither does anyone being told that it works for the use cases.

dist-epoch•16m ago

https://www.solidworks.com/lp/evolve-your-design-workflows-a...

__alexs•40m ago

I guess the idea is that by asking the model to do something that is inherently hard for it we might learn something about the baseline smartness of each model which could be considered a predictor for performance at other tasks too.

dmd•37m ago

Sorry, Beethoven, you just don’t seem to be a match for our org. Best of luck on your search!

You too, Monet. Scram.

matkoniecz•1h ago

it depends on quality you need and your budget

neepi•50m ago

Ah yes the race to the bottom argument.

ben_w•10m ago

When I was at university, they got some people from industry to talk to us all about our CVs and how to do interviews.

My CV had a stupid cliché, "committed to quality", which they correctly picked up on — "What do you mean?" one of them asked me, directly.

I thought this meant I was focussed on being the best. He didn't like this answer.

His example, blurred by 20 years of my imperfect human memory, was to ask me which is better: a Porsche, or a go-kart. Now, obviously (or I wouldn't be saying this), Porsche was a trick answer. Less obviously is that both were trick answers, because their point was that the question was under-specified — quality is the match between the product and what the user actually wants, so if the user is a 10 year old who physically isn't big enough to sit in a real car's driver's seat and just wants to rush down a hill or along a track, none of "quality" stuff that makes a Porsche a Porsche is of any relevance at all, but what does matter is the stuff that makes a go-kart into a go-kart… one of which is the affordability.

LLMs are go-karts of the mind. Sometimes that's all you need.

keiferski•40m ago

As the other guy said, these are text models. If you want to make images use something like Midjourney.

Promoting a pelican riding a bicycle makes a decent image there.

joshstrange•1h ago

I really enjoy Simon’s work in this space. I’ve read almost every blog post they’ve posted on this and I love seeing them poke and prod the models to see what pops out. The CLI tools are all very easy to use and complement each other nicely all without trying to do too much by themselves.

And at the end of the day, it’s just so much fun to see someone else having so much fun. He’s like a kid in a candy store and that excitement is contagious. After reading every one of his blog posts, I’m inspired to go play with LLMs in some new and interesting way.

Thank you Simon!

nathan_phoenix•1h ago

My biggest gripe is that he's comparing probabilistic models (LLMs) by a single sample.

You wouldn't compare different random number generators by taking one sample from each and then concluding that generator 5 generates the highest numbers...

Would be nicer to run the comparison with 10 images (or more) for each LLM and then average.

puttycat•35m ago

You are right, but the companies making these models invest a lot of effort in marketing them as anything but probabilistic, i.e. making people think that these models work discretely like humans.

In that case we'd expect a human with perfect drawing skills and perfect knowledge about bikes and birds to output such a simple drawing correctly 100% of the time.

In any case, even if a model is probabilistic, if it had correctly learned the relevant knowledge you'd expect the output to be perfect because it would serve to lower the model's loss. These outputs clearly indicate flawed knowledge.

ben_w•30m ago

> In that case we'd expect a human with perfect drawing skills and perfect knowledge about bikes and birds to output such a simple drawing correctly 100% of the time.

Look upon these works, ye mighty, and despair: https://www.gianlucagimini.it/portfolio-item/velocipedia/

planb•33m ago

And by a sample that has become increasingly known as a benchmark. Newer training data will contain more articles like this one, which naturally improves the capabilities of an LLM to estimate what’s considered a good „pelican on a bike“.

anon373839•56m ago

Enjoyable write-up, but why is Qwen 3 conspicuously absent? It was a really strong release, especially the fine-grained MoE which is unlike anything that’s come before (in terms of capability and speed on consumer hardware).

Maxious•15m ago

Cut for time - qwen3 was pelican tested too https://simonwillison.net/2025/Apr/29/qwen-3/

qwertytyyuu•36m ago

https://imgur.com/a/mzZ77xI here are a few i tried the models, looks like the newer vesion of gemini is another improvement?

puttycat•33m ago

The bicycle are still very far from actual ones.

JimDabell•33m ago

See also: The recent history of AI in 32 otters

https://www.oneusefulthing.org/p/the-recent-history-of-ai-in...

bravesoul2•32m ago

Is there a good model (any architecture) for vector graphics out of interest?

dirtyhippiefree•21m ago

Here’s the spot where we see who’s TL;DR…

> Claude 4 will rat you out to the feds!

>If you expose it to evidence of malfeasance in your company, and you tell it it should act ethically, and you give it the ability to send email, it’ll rat you out.

ben_w•6m ago

I'd say that's too short.

> But it’s not just Claude. Theo Browne put together a new benchmark called SnitchBench, inspired by the Claude 4 System Card.

> It turns out nearly all of the models do the same thing.

Whither Help Scout?

Linux or Landfill? End of Windows 10 Leaves PC Charities with Tough Choice

Swift 6 Productivity in the Sudden Age of LLM-Assisted Programming

Life in 2045

Dancing brainwaves: How sound reshapes your brain networks in real time

Nanobrowser: Open-Source Chrome extension for AI-powered web automation

DejaGNU (2011)

Google Wins Copyright Claim Dismissal in Publishers' Textbook Piracy Lawsuit

Top Vibe Coding Tools to Boost Your Productivity

Show HN: AISheets: PDF-to-interactive worksheets (with LaTeX support)

The Secret Engine That Makes Go 10x Faster Than You Think

Show HN: A Minimal Productivity Dashboard on a Raspberry Pi

Show HN: Merge Images

Show HN: I built a real-time dashboard to visualize my hourly earnings

No JS, No BS Ethical Web Analytics

Junited 2025

White House security staff warned Musk's Starlink is a security risk

The Number of Satellites Launched into Space

Stefan Zweig Followed His Europe into Suicide (2017)

Ask HN: How to learn CUDA to professional level

I got a remote job for a EU company, I'd find it hard to go back to a US-based

Apple Is on Defense at WWDC

Pen.el (A Holy OS)

Keeping Ahead of Contagion

What I've Learned from 15 Years of Doing OKRs

Getting nothing done: the art of stress-free non-productivity

Putin's WeChat Wager

Ted Cruz bill: States that regulate AI will be cut out of $42B broadband fund

A Japanese lander crashed on the Moon after losing track of its location

Astronomers thought Milky Way doomed to crash into Andromeda. Now not so sure