And at the end of the day, it’s just so much fun to see someone else having so much fun. He’s like a kid in a candy store and that excitement is contagious. After reading every one of his blog posts, I’m inspired to go play with LLMs in some new and interesting way.
Thank you Simon!
Because of him, I installed a RSS reader so that I don't miss any of his posts. And I know that he shares the same ones across Twitter, Mastodon & Bsky...
You wouldn't compare different random number generators by taking one sample from each and then concluding that generator 5 generates the highest numbers...
Would be nicer to run the comparison with 10 images (or more) for each LLM and then average.
In that case we'd expect a human with perfect drawing skills and perfect knowledge about bikes and birds to output such a simple drawing correctly 100% of the time.
In any case, even if a model is probabilistic, if it had correctly learned the relevant knowledge you'd expect the output to be perfect because it would serve to lower the model's loss. These outputs clearly indicate flawed knowledge.
Look upon these works, ye mighty, and despair: https://www.gianlucagimini.it/portfolio-item/velocipedia/
Most humans do not have perfect drawing skills and perfect knowledge about bikes and birds, they do not output such a simple drawing correctly 100% of the time.
"Average human" is a much lower bar than most people want to believe, mainly because most of us are average on most skills, and also overestimate our own competence — the modal human has just a handful of things they're good at, and one of those is the language they use, another is their day job.
Most of us can't draw, and demonstrably can't remember (or figure out from first principles) how a bike works. But this also applies to "smart" subsets of the population: physicists have https://xkcd.com/793/, and there's this famous rocket scientist who weighed in on rescuing kids from a flooded cave, they come up with some nonsense about a submarine.
Ask 100 random people to draw a bike and in 10 minutes and they’ll on average suck while still beating the LLM’s here. Give em an incentive and 10 months and the average person is going to be able to make at least one quite decent drawing of a bike.
The cost and speed advantage of LLM’s is real as long as you’re fine with extremely low quality. Ask a model for 10,000 drawings so you can pick the best and you get a marginal improvements based on random chance at a steep price.
Y'see, this is a prime example of what I meant with ""Average human" is a much lower bar than most people want to believe, mainly because most of us are average on most skills, and also overestimate our own competence".
An expert artist can spend 10 minutes and end up with a brief sketch of a bike. You can witness this exact duration yourself (with non-bike examples) because of a challenge a few years back to draw the same picture in 10 minutes, 1 minute, and 10 seconds.
A normal person spending as much time as they like gets you the pictures that I linked to in the previous post, because they don't really know what a bike is. 45 examples of what normal people think a bike looks like: https://www.gianlucagimini.it/portfolio-item/velocipedia/
> Give em an incentive and 10 months and the average person is going to be able to make at least one quite decent drawing of a bike.
Given mandatory art lessons in school are longer than 10 months, and yet those bike examples exist, I have no reason to believe this.
> Ask a model for 10,000 drawings so you can pick the best and you get a marginal improvements based on random chance at a steep price.
If you do so as a human, rating and comparing images? Then the cost is your own time.
If you automate it in literally the manner in this write-up (pairwise comparison via API calls to another model to get ELO ratings), ten thousand images is like $60-$90, which is on the low end for a human commission.
A normal person given the ability to consult a picture of a bike while drawing will do much better. An LLM agent can effectively refresh its memory (or attempt to look up information on the Internet) any time it wants.
Some models can when allowed to, but I don't belive Simon Willson was testing that?
> ""Average human" is a much lower bar than most people want to believe
I have some basis for comparison. I’ve seen 6 years olds draw better bikes than those LLM’s.
Look through that list again the worst example does even have wheels, multiple of them have wheels without being connected to anything.
Now if you’re arguing the average human is worse than the average 6 year old I’m going to disagree here.
> Given mandatory art lessons in school are longer than 10 months, and yet those bike examples exist, I have no reason to believe this.
Art lessons don’t cumulatively spend 10 months teaching people how to draw a bike. I don’t think I cumulatively spent 6 months drawing anything. Painting, collage, sculpture, coloring, etc art covers a lot and wasn’t an every day or even every year thing. My mandatory collage class was art history, we didn’t create any art.
You may have spent more time in class studying drawing, but that’s not some universal average.
> If you automate it in literally the manner in this write-up (pairwise comparison via API calls to another model to get ELO ratings), ten thousand images is like $60-$90, which is on the low end for a human commission.
Not every one of those images had a price tag but one was 88 cents, * 10,000 = 8,800$ just to make the image for a test even at 4c/image your looking at 400$. Cheaper models existed but fairly consistently had worse performance.
Also, when you’re talking about how cheap something is, including the price makes sense. I had no idea on many of those models.
That link seeds it with 11 input tokens and 1200 output tokens - 11 input tokens is what most models use for "Generate an SVG of a pelican riding a bicycle" and 1200 is the number of output tokens used for some of the larger outputs.
Click on different models to see estimated prices. They range from 0.0168 cents for Amazon Nova Micro (that's less than 2/100ths of a cent) up to 72 cents for o1-pro.
The most expensive model most people would consider is Claude 4 Opus, at 9 cents.
GPT-4o is the upper end of the most common prices, at 1.2 cents.
What kind of humans are you surrounded by?
Ask any human to write 3 sentences about a specific topic. Then ask them the same exact question next day. They will not write the same 3 sentences.
Call it wikipediaslop.org
I actually don't think I've seen a single correct svg drawing for that prompt.
I've been considering an expanded version of this where each model outputs ten images, then a vision model helps pick the "best" of those to represent that model in a further competition with other models.
(Then I would also expand the judging panel to three vision LLMs from different model families which vote on each round... partly because it will be interesting to track cases where the judges disagree.)
I'm not sure if it's worth me doing that though since the whole "benchmark" is pretty silly. I'm on the fence.
Any concerns about open source “AI celebrity talks” like yours can be used in contexts that would allow LLM models to optimize their market share in ways that we can’t imagine yet?
Your talk might influence the funding of AI startups.
#butterflyEffect
Simon, hope you are comfortable in your new role of AI Celebrity.
Karpathy used it as an example in a recent interview: https://www.msn.com/en-in/health/other/ai-expert-asks-grok-3...
The number of subject-verb-objects are near infinite. All are imaginable, but most are not plausible. A plausibility machine (LLM) will struggle with the implausible, until it can abstract well.
Until there is enough unique/new subject-verb-objects examples/benchmarks so the trained model actually generalized it just like you did. (Public) Benchmarks needs to constantly evolve, otherwise they stop being useful.
clarification: I enjoyed the pelican on a bike and don't think it's that bad =p
Also, as the old saying goes, the only thing worse than using benchmarks is not using benchmarks.
people expect LLMs to say "correct" stuff on the first attempt, not 10000 attempts.
Yet, these people are perfectly OK with cherry-picked success stories on youtube + advertisements, while being extremely vehement about this simple experiment...
...well maybe these people rode the LLM hype-train too early, and are desperate to defend LLMs lest their investment go poof?
obligatory hype-graph classic: https://upload.wikimedia.org/wikipedia/commons/thumb/9/94/Ga...
In your case, it would be neat to have a bunch of different models (and maybe MTurk) pick the winners of each head-to-head matchup and then compare how stable the Elo scores are between evaluators.
And there is no reason that these models need to be non-deterministic.
So there’s still the question of how controllable the LLM really is. If you change a prompt slightly, how unpredictable is the change? That can’t be tested with one prompt.
My thoughts too. It's more accurate to label LLMs as non-deterministic instead of "probablistic".
I get it was way easier to do and that doing it took pennies and no time. But I would have loved it if he'd tried alternate methods of judging and seen what the results were.
Other ways:
* wisdom of the crowds (have people vote on it)
* wisdom of the experts (send the pelican images to a few dozen artists or ornithologists)
* wisdom of the LLMs (use more than one LLM)
Would have been neat to see what the human consensus was and if it differed from the LLM consensus
Anyway, great talk!
https://www.google.com/search?q=pelican&udm=2
The "closest pelican" is not even close.
It's one of my favorite local models right now, I'm not sure how I missed it when I was reviewing my highlights of the last six months.
https://www.oneusefulthing.org/p/the-recent-history-of-ai-in...
Thanks for sharing.
> Claude 4 will rat you out to the feds!
>If you expose it to evidence of malfeasance in your company, and you tell it it should act ethically, and you give it the ability to send email, it’ll rat you out.
> But it’s not just Claude. Theo Browne put together a new benchmark called SnitchBench, inspired by the Claude 4 System Card.
> It turns out nearly all of the models do the same thing.
Someone commissioned to draw a bicycle on Fiverr would not have to rely on memory of what it should look like. It would take barely any time to just look up a reference.
Say what you want about Facebook but at least they released their flagship model fully open.
Besides, it's so heavily context-dependent that you really need your own private benchmarks to make head or tails out of this whole thing.
The prompt is "Generate an SVG of a pelican riding a bicycle" and you're supposed to write it by hand, so no graphical editor. The specification is here: https://www.w3.org/TR/SVG2/
I'm fairly certain I'd lose interest in getting it right before I got something better than most of those.
There are 31 posts listed under "pelican-riding-a-bicycle" in case you wanna inspect the methodology even closer: https://simonwillison.net/tags/pelican-riding-a-bicycle/
Generate an SVG of a pelican riding a bicycle
And execute it via the model's API with all default settings, not via their user-facing interface.Currently none of the model APIs enable tools unless you ask them to, so this method excludes the use of additional tools.
The output pelican is indeed blue. I can't fathom where the idea that this is "classic", or suitable for a pelican, could have come from.
It certainly would, and it would cost at minimum an hour of the human programmer's time at $50+/hr. Claude does it in seconds for pennies.
Awkwardly, I never heard of it until now. I was aware that at some point they added ability to generate images to the app, but I never realized it was a major thing (plus I already had an offline stable diffusion app on my phone, so it felt less of an upgrade to me personally). With so much AI news each week, feels like unless you're really invested in the space, it's almost impossible to not accidentally miss or dismiss some big release.
It really is incredible.
none of these ephemeral fads are any indication of quality, longevity, legitimacy, interest, substance, endurance, prestige, relevance, credibility, allure, staying-power, refinement, or depth.
That many signups is impressive no matter what. The attempts to downplay every aspect of LLM popularity are getting really tiresome.
Doesn’t it?
100M people signed up and did at least 1 task. Then, most likely some % of them discovered it was a useful thing (if for nothing else than just to make more memes), and converted into a MAU.
If I had to use my intuition, I would say it's 5% - 10%, which represents a larger product launch than most developers will ever participate in, in the context of a single day.
Of course the ongoing stickiness of the MAU also depends on the ability of this particular tool to stay on top amongst increasing competition.
Uber at a 10x scale.
I should add that compared to the hype, at a global level Uber is a failure. Yes, it's still a big company, yes, it's profitable now, but I think it was launched 10+ years ago and it's barely becoming net profitabile over it's existence now and shows no signs of taking over the world. Sure, it's big in the US and a few specific markets. But elsewhere it's either banned for undermining labor practices or has stiff local competition or it's just not cost competitive and it won't enter the market because without the whole "gig economy" scam it's just a regular taxi company with a better app.
https://www.wheresyoured.at/wheres-the-money/
https://www.wheresyoured.at/openai-is-a-systemic-risk-to-the...
A very solid argument is like that against propaganda: it's not so much about what is being said but what about isn't. OpenAI is basically shouting about every minor achievement from the rooftops so the fact that they are remarkably silent about financial fundamentals says something. At best something mediocre or more likely bad.
Basically, it's one of those things you may read and find that, all things considered, you don't agree with the conclusions, but there's real substance there and you'll probably benefit from reading a few of his articles.
Source? They did exactly that.
Ape NFTs are… ape NFTs. Useless. Pointless. Negative value for most people.
This is deja vu, except instead of ChatGPT to edit photos it was instagram a decade ago.
Reproducing a certain style of image has been a regular fad since profile pictures became a thing sometime last century.
I was not meaning to suggest that large language & diffusion models are fads.
(I do think their capabilities are poorly understood and/or over-estimated by non-technical and some technical people alike, but that invites a more nuanced discussion.)
While I'm sure your wife is getting good value out of the system, whether it's a better fit for purpose, produces a better quality, or provides a more satisfying workflow -- than say a decent free photo editor -- or whether other tools were tried but determined to be too limited or difficult, etc -- only you or her could say. It does feel like a small sample set, though.
We're talking about Hitler memes instead? I don't understand your feigned outrage.
The actual valid commercial use case for generative images hasn't been found yet. (No, making blog spam prettier is not a good use case.)
I think it's broken out into mainstream adoption and is going to stay there.
It reminds me a little of Napster. The Napster UI was terrible, but it let people do something they had never been able to do before: listen to any piece of music ever released, on-demand. As a result people with almost no interest in technology at all were learning how to use it.
Most people have never had the ability to turn a photo of their kids into a cute cartoon before, and it turns out that's something they really want to be able to do.
Again, I was aware that they added image generation, just not how much of a deal it turned out to be. Think of it like me occasionally noticing merchandise and TV trailers for a new movie without realizing it became the new worldwide box office #1.
This measure of LLM capability could be extended by taking it into the 3D domain.
That is, having the model write Python code for Blender, then running blender in headless mode behind an API.
The talk hints at this but one shot prompting likely won’t be a broad enough measurement of capability by this time next year. (Or perhaps now, even)
So the test could also include an agentic portion that includes consultation of the latest blender documentation or even use of a search engine for blog entries detailing syntax and technique.
For multimodal input processing, it could take into account a particular photo of a pelican as the test subject.
For usability, the objects can be converted to iOS’s native 3d format that can be viewed in mobile safari.
I built this workflow, including a service for blender as an initial test of what was possible in October of 2022. It took post processing for common syntax errors back then but id imagine the newer LLMs would make those mistakes less often now.
But bicycles are famously hard for artists as well. Cyclists can identify all of the parts, but if you don't ride a lot it can be surprisingly difficult to get all of the major bits of geometry right.
Here is better example of start https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTfTfAA...
I did that to my daughter when she was not even 6 years old. The results were somehow similar: https://photos.app.goo.gl/XSLnTEUkmtW2n7cX8
(Now she's much better, but prefers raster tools, e.g. https://www.deviantart.com/sofiac9/art/Ivy-with-riding-gear-...)
The top results (click on the top Solutions) were pretty impressive: https://www.kaggle.com/competitions/drawing-with-llms/leader...
"Say I have a wolf, a goat, and some cabbage, and I want to get them across a river. The wolf will eat the goat if they're left alone, which is bad. The goat will eat some cabbage, and will starve otherwise. How do I get them all across the river in the fewest trips?"
A child would pick up that you have plenty of cabbage, but can't leave the goat without it, lest it starve. Also, there's no mention of boat capacity, so you could just bring them all over at once. Useful? Sometimes. Intelligent? No.
> As a power user of these tools, I want to stay in complete control of what the inputs are. Features like ChatGPT memory are taking that control away from me.
You reap what you sow....
> I already have a tool I built called shot-scraper, a CLI app that lets me take screenshots of web pages and save them as images. I had Claude build me a web page that accepts ?left= and ?right= parameters pointing to image URLs and then embeds them side-by-side on a page. Then I could take screenshots of those two images side-by-side. I generated one of those for every possible match-up of my 34 pelican pictures—560 matches in total.
Surely it would have been easier to use a local tool like ImageMagick? You could even have the AI write a Bash script for you.
> ... but prompt injection is still a thing.
...Why wouldn't it always be? There's no quoting or escaping mechanism that's actually out-of-band.
> There’s this thing I’m calling the lethal trifecta, which is when you have an AI system that has access to private data, and potential exposure to malicious instructions—so other people can trick it into doing things... and there’s a mechanism to exfiltrate stuff.
People in 2025 actually need to be told this. Franklin missed the mark - people today will trip over themselves to give up both their security and their liberty for mere convenience.
And honestly, even with LLM assistance getting Image Magick to output a 1200x600 image with two SVGs next to each other that are correctly resized to fill their half of the image sounds pretty tricky. Probably easier (for Claude) to achieve with HTML and CSS.
FWIW, the next project I want to look at after my current two, is a command-line tool to make this sort of thing easier. Likely featuring some sort of Lisp-like DSL to describe what to do with the input images.
Isn't it Δ∇Λ welded together? The bottom left and right vertices are where the wheels are attached to, the middle bottom point is where the big gear with the pedals is. The lambda is for the front wheel because you wouldn't be able to turn it if it was attached to a delta. Right?
I guess having my first bicycle be a cheap Soviet-era produced one paid off: I spent loads of time fidgeting with the chain tension, and pulling the chain back onto the gears, so I guess I had to stare at the frame way too much to forget even by today the way it looks.
https://www.gianlucagimini.it/portfolio-item/velocipedia/
> back in 2009 I began pestering friends and random strangers. I would walk up to them with a pen and a sheet of paper asking that they immediately draw me a men’s bicycle, by heart. Soon I found out that when confronted with this odd request most people have a very hard time remembering exactly how a bike is made.
I had heard of prompt injection already. But, this seems different, completely out of humans control. Like even when you consider web search functionality, he is actually right, more and more, users are losing control over context.
Is this dangerous atm? Do you think it will become more dangerous in the future when we chuck even more data into context?
The issue is that LLMs have no ability to organise their memory by importance. Especially as the context size gets larger.
So when they are using tools they will become more dangerous over time.
> And then I saw this in the Google I/O keynote a few weeks ago, in a blink and you’ll miss it moment! There’s a pelican riding a bicycle! They’re on to me. I’m going to have to switch to something else.
Yeah this touches on an issue that makes it very difficult to have a discussion in public about AI capabilities. Any specific test you talk about, no matter how small … if the big companies get wind of it, it will be RLHF’d away, sometimes to the point of absurdity. Just refer to the old “count the ‘r’s in strawberry” canard for one example.
(And I'd be envious of your impact, of course)
"The word "strawberry" contains 2 letter r’s."
strawberry -> DeepSeek, GeminiPro and ChatGPT4o all correctly said three
strawberrry -> DeepSeek, GeminiPro and ChatGPT4o all correctly said four
stawberrry -> DeepSeek, GeminiPro all correctly said three
ChatGPT4o even in a new Chat, incorrectly said the word "stawberrry" contains 4 letter "r" characters. Even provided this useful breakdown to let me know :-)
Breakdown: stawberrry → s, t, a, w, b, e, r, r, r, y → 4 r's
And then asked if I meant "strawberry" instead and said because that one has 2 r's....
x8 version: still shit . . x15 version: we are closing, but overall a shit experience :D
this way they won't know what to improve upon. of course they can buy access. ;P
when they finally solve your problem you can reveal what was the benchmark.
``` const melody = [ { freq: 261.63, duration: 'quarter' }, // C4 { freq: 0, duration: 'triplet' }, // triplet rest { freq: 293.66, duration: 'triplet' }, // D4 { freq: 0, duration: 'triplet' }, // triplet rest { freq: 329.63, duration: 'half' }, // E4 ] ```
But with the intro to Smoke on the Water by Deep Purple. Then I run it through the Web Audio API and see how it sounds.
It's never quite gotten it right, but it's gotten better, to the point where I can ask it to make a website that can play it.
I think yours is a lot more thoughtful about testing novelty, but its interesting to see them attempt to do things that they aren't really built for (in theory!).
https://codepen.io/mvattuone/pen/qEdPaoW - ChatGPT 4 Turbo
https://codepen.io/mvattuone/pen/ogXGzdg - Claude Sonnet 3.7
https://codepen.io/mvattuone/pen/ZYGXpom - Gemini 2.5 Pro
Gemini is by far the best sounding one, but it's still off. I'd be curious how the latest and greatest (paid) versions fare.
(And just for comparison, here's the first time I did it... you can tell I did the front-end because there isn't much to it!) https://nitter.space/mvattuone/status/1646610228748730368#m
I know there are far better ways to do gen AI with music, this was just a joke prompt that worked far better than I expected.
My naive guess is all of the guitar tabs and signal processing info it's trained on gives it the ability to do stuff like this (albeit not very well).
Also, if it becomes popular, training sets may pick it up and improve models unfairly and unrealistically. But that's true of any known benchmark.
Side note: I'd really like to see the Language Benchmark Game become a prompt based languages * models benchmark game. So we could say model X excels at Python Fasta, etc. although then the risk is that, again, it becomes training set and the whole thing self-rigs itself.
I also can't help but notice that the competition is exactly one match short, for some reason exactly one of the 561 possible pairings has not been included.
The missing match is because one single round was declared a draw by the model, and I didn't have time to run it again (the Elo stuff was very much rushed at the last minute.)
OP seem to ignore that pelican has a distinct look when evaluating these doodles.
You didn't even mention the beak or the lack of similarities in your blog.
Your text is centered around this rather peculiar statement:
> Most importantly: pelicans can’t ride bicycles.
"Pelicans can't ride bicycles" is a good joke.
GIGO in motion :-)
no... that is an attempt at it actually drawing the pedals, and putting the pelicans feet right on the pedals!
neepi•8mo ago
dist-epoch•8mo ago
neepi•8mo ago
I would not hire a blind artist or a deaf musician.
dist-epoch•8mo ago
Like asking you to draw a 2D projection of 4D sphere intersected with a 4D torus or something.
kevindamm•8mo ago
namibj•8mo ago
Most the non-math design work of applied engineering AFAIK falls under the umbrella that's tested with the pelican riding the bicycle. You have to make a mental model and then turn it into applicable instructions.
Program code/SVG markup/parametric CAD instructions don't really differ in that aspect.
neepi•8mo ago
Ergo no you can't just say throw a bicycle into an LLM and a parametric model drops out into solidworks, then a machine makes it. And everyone buys it. That is the hope really isn't it? You end up with a useless shitty bike with a shit pelican on it.
The biggest problem we have in the LLM space is the fact that no one really knows any of the proposed use cases enough and neither does anyone being told that it works for the use cases.
dist-epoch•8mo ago
neepi•8mo ago
rjsw•8mo ago
neepi•8mo ago
__alexs•8mo ago
dmd•8mo ago
You too, Monet. Scram.
simonw•8mo ago
It's a fun way to deflate the hype. Sure, your new LLM may have cost XX million to train and beat all the others on the benchmarks, but when you ask it to draw a pelican on a bicycle it still outputs total junk.
dist-epoch•8mo ago
https://chatgpt.com/share/684582a0-03cc-8006-b5b5-de51e5cd89...
lol: https://gemini.google.com/share/4d1746a234a8
wongogue•8mo ago
matkoniecz•8mo ago
neepi•8mo ago
ben_w•8mo ago
My CV had a stupid cliché, "committed to quality", which they correctly picked up on — "What do you mean?" one of them asked me, directly.
I thought this meant I was focussed on being the best. He didn't like this answer.
His example, blurred by 20 years of my imperfect human memory, was to ask me which is better: a Porsche, or a go-kart. Now, obviously (or I wouldn't be saying this), Porsche was a trick answer. Less obviously is that both were trick answers, because their point was that the question was under-specified — quality is the match between the product and what the user actually wants, so if the user is a 10 year old who physically isn't big enough to sit in a real car's driver's seat and just wants to rush down a hill or along a track, none of "quality" stuff that makes a Porsche a Porsche is of any relevance at all, but what does matter is the stuff that makes a go-kart into a go-kart… one of which is the affordability.
LLMs are go-karts of the mind. Sometimes that's all you need.
neepi•8mo ago
Go kart or porsche is irrelevant.
ben_w•8mo ago
That's the point.
The market for go-karts does not support Porche.
If you bring a Porche sales team to a go-kart race, nobody will be interested.
Porche doesn't care about this market. It goes both ways: this market doesn't care about Porche, either.
keiferski•8mo ago
Promoting a pelican riding a bicycle makes a decent image there.
keiferski•8mo ago
GaggiX•8mo ago
jug•8mo ago
Result: https://www.dropbox.com/scl/fi/8b03yu5v58w0o5he1zayh/pelican...
These are tough benchmarks to trial reasoning by having it _write_ an SVG file by hand and understanding how it's to be written to achieve this. Even a professional would struggle with that! It's _not_ a benchmark to give an AI the best tools to actually do this.
YuccaGloriosa•8mo ago
sethaurus•8mo ago
spaceman_2020•8mo ago
vunderba•8mo ago
A similar test would be if you asked for the pelican on a bicycle through a series of LOGO instructions.