frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

The last six months in LLMs, illustrated by pelicans on bicycles

https://simonwillison.net/2025/Jun/6/six-months-in-llms/
81•swyx•4h ago

Comments

neepi•1h ago
My only take home is they are all terrible and I should hire a professional.
dist-epoch•1h ago
Most of them are text-only models. Like asking a person born blind to draw a pelican, based on what they heard it looks like.
neepi•49m ago
That seems to be a completely inappropriate use case?

I would not hire a blind artist or a deaf musician.

dist-epoch•44m ago
The point is about exploring the capabilities of the model.

Like asking you to draw a 2D projection of 4D sphere intersected with a 4D torus or something.

namibj•43m ago
It's a proxy for abstract designing, like writing software or designing in a parametric CAD.

Most the non-math design work of applied engineering AFAIK falls under the umbrella that's tested with the pelican riding the bicycle. You have to make a mental model and then turn it into applicable instructions.

Program code/SVG markup/parametric CAD instructions don't really differ in that aspect.

neepi•23m ago
I would not assume that this methodology applies to applied engineering, as a former actual real tangible meat space engineer. Things are a little nuanced and the nuances come from a combination of communication and experience, neither of which any LLM has any insight into at all. It's not out there on the internet to train it with and it's not even easy to put it into abstract terms which can be used as training data. And engineering itself in isolation doesn't exist - there is a whole world around it.

Ergo no you can't just say throw a bicycle into an LLM and a parametric model drops out into solidworks, then a machine makes it. And everyone buys it. That is the hope really isn't it? You end up with a useless shitty bike with a shit pelican on it.

The biggest problem we have in the LLM space is the fact that no one really knows any of the proposed use cases enough and neither does anyone being told that it works for the use cases.

dist-epoch•16m ago
https://www.solidworks.com/lp/evolve-your-design-workflows-a...
__alexs•40m ago
I guess the idea is that by asking the model to do something that is inherently hard for it we might learn something about the baseline smartness of each model which could be considered a predictor for performance at other tasks too.
dmd•37m ago
Sorry, Beethoven, you just don’t seem to be a match for our org. Best of luck on your search!

You too, Monet. Scram.

matkoniecz•1h ago
it depends on quality you need and your budget
neepi•50m ago
Ah yes the race to the bottom argument.
ben_w•10m ago
When I was at university, they got some people from industry to talk to us all about our CVs and how to do interviews.

My CV had a stupid cliché, "committed to quality", which they correctly picked up on — "What do you mean?" one of them asked me, directly.

I thought this meant I was focussed on being the best. He didn't like this answer.

His example, blurred by 20 years of my imperfect human memory, was to ask me which is better: a Porsche, or a go-kart. Now, obviously (or I wouldn't be saying this), Porsche was a trick answer. Less obviously is that both were trick answers, because their point was that the question was under-specified — quality is the match between the product and what the user actually wants, so if the user is a 10 year old who physically isn't big enough to sit in a real car's driver's seat and just wants to rush down a hill or along a track, none of "quality" stuff that makes a Porsche a Porsche is of any relevance at all, but what does matter is the stuff that makes a go-kart into a go-kart… one of which is the affordability.

LLMs are go-karts of the mind. Sometimes that's all you need.

keiferski•40m ago
As the other guy said, these are text models. If you want to make images use something like Midjourney.

Promoting a pelican riding a bicycle makes a decent image there.

joshstrange•1h ago
I really enjoy Simon’s work in this space. I’ve read almost every blog post they’ve posted on this and I love seeing them poke and prod the models to see what pops out. The CLI tools are all very easy to use and complement each other nicely all without trying to do too much by themselves.

And at the end of the day, it’s just so much fun to see someone else having so much fun. He’s like a kid in a candy store and that excitement is contagious. After reading every one of his blog posts, I’m inspired to go play with LLMs in some new and interesting way.

Thank you Simon!

nathan_phoenix•1h ago
My biggest gripe is that he's comparing probabilistic models (LLMs) by a single sample.

You wouldn't compare different random number generators by taking one sample from each and then concluding that generator 5 generates the highest numbers...

Would be nicer to run the comparison with 10 images (or more) for each LLM and then average.

puttycat•35m ago
You are right, but the companies making these models invest a lot of effort in marketing them as anything but probabilistic, i.e. making people think that these models work discretely like humans.

In that case we'd expect a human with perfect drawing skills and perfect knowledge about bikes and birds to output such a simple drawing correctly 100% of the time.

In any case, even if a model is probabilistic, if it had correctly learned the relevant knowledge you'd expect the output to be perfect because it would serve to lower the model's loss. These outputs clearly indicate flawed knowledge.

ben_w•30m ago
> In that case we'd expect a human with perfect drawing skills and perfect knowledge about bikes and birds to output such a simple drawing correctly 100% of the time.

Look upon these works, ye mighty, and despair: https://www.gianlucagimini.it/portfolio-item/velocipedia/

planb•33m ago
And by a sample that has become increasingly known as a benchmark. Newer training data will contain more articles like this one, which naturally improves the capabilities of an LLM to estimate what’s considered a good „pelican on a bike“.
anon373839•56m ago
Enjoyable write-up, but why is Qwen 3 conspicuously absent? It was a really strong release, especially the fine-grained MoE which is unlike anything that’s come before (in terms of capability and speed on consumer hardware).
Maxious•15m ago
Cut for time - qwen3 was pelican tested too https://simonwillison.net/2025/Apr/29/qwen-3/
qwertytyyuu•36m ago
https://imgur.com/a/mzZ77xI here are a few i tried the models, looks like the newer vesion of gemini is another improvement?
puttycat•33m ago
The bicycle are still very far from actual ones.
JimDabell•33m ago
See also: The recent history of AI in 32 otters

https://www.oneusefulthing.org/p/the-recent-history-of-ai-in...

bravesoul2•32m ago
Is there a good model (any architecture) for vector graphics out of interest?
dirtyhippiefree•21m ago
Here’s the spot where we see who’s TL;DR…

> Claude 4 will rat you out to the feds!

>If you expose it to evidence of malfeasance in your company, and you tell it it should act ethically, and you give it the ability to send email, it’ll rat you out.

ben_w•6m ago
I'd say that's too short.

> But it’s not just Claude. Theo Browne put together a new benchmark called SnitchBench, inspired by the Claude 4 System Card.

> It turns out nearly all of the models do the same thing.

Whither Help Scout?

https://bitsplitting.org/2025/04/30/whither-help-scout/
1•tosh•1m ago•0 comments

Linux or Landfill? End of Windows 10 Leaves PC Charities with Tough Choice

https://www.tomshardware.com/software/operating-systems/linux-or-landfill-end-of-windows-10-leaves-pc-charities-with-tough-choice
1•Filligree•1m ago•0 comments

Swift 6 Productivity in the Sudden Age of LLM-Assisted Programming

https://daringfireball.net/linked/2025/06/07/swift-6-llms
1•tosh•6m ago•0 comments

Life in 2045

https://www.instagram.com/reel/DKkrY-Th2zN/
1•Kaibeezy•8m ago•0 comments

Dancing brainwaves: How sound reshapes your brain networks in real time

https://www.sciencedaily.com/releases/2025/06/250602155001.htm
1•lentoutcry•10m ago•0 comments

Nanobrowser: Open-Source Chrome extension for AI-powered web automation

https://github.com/nanobrowser/nanobrowser
1•simonpure•14m ago•0 comments

DejaGNU (2011)

https://www.airs.com/blog/archives/499
1•fanf2•15m ago•0 comments

Google Wins Copyright Claim Dismissal in Publishers' Textbook Piracy Lawsuit

https://torrentfreak.com/google-wins-copyright-claim-dismissal-in-publishers-textbook-piracy-lawsuit-250608/
2•Improvement•19m ago•1 comments

Top Vibe Coding Tools to Boost Your Productivity

https://medium.com/developersglobal/top-vibe-coding-tools-to-boost-your-productivity-c5644d2548f8
2•dhanushnehru•20m ago•0 comments

Show HN: AISheets: PDF-to-interactive worksheets (with LaTeX support)

https://www.aisheets.study/
1•pk97•22m ago•0 comments

The Secret Engine That Makes Go 10x Faster Than You Think

https://dhanushnehru.medium.com/the-secret-engine-that-makes-go-10x-faster-than-you-think-5d3317334a27
2•dhanushnehru•22m ago•0 comments

Show HN: A Minimal Productivity Dashboard on a Raspberry Pi

https://github.com/108charlotte/magic-mirror
1•108charlotte•23m ago•0 comments

Show HN: Merge Images

https://mergemyimages.com/
1•artiomyak•28m ago•0 comments

Show HN: I built a real-time dashboard to visualize my hourly earnings

https://www.bobo.wtf/
2•barisbll•37m ago•0 comments

No JS, No BS Ethical Web Analytics

http://trop.in/blog/no-js-no-bs-ethical-web-analytics
1•true_pk•42m ago•0 comments

Junited 2025

https://birming.com/2025/06/01/junited/
1•DamonHD•46m ago•1 comments

White House security staff warned Musk's Starlink is a security risk

https://www.washingtonpost.com/technology/2025/06/07/starlink-white-house-security-doge-musk/
5•doener•46m ago•2 comments

The Number of Satellites Launched into Space

https://twitter.com/MAstronomers/status/1931417310532645130
1•keepamovin•54m ago•1 comments

Stefan Zweig Followed His Europe into Suicide (2017)

https://www.theamericanconservative.com/stefan-zweig-followed-his-europe-into-suicide/
1•Michelangelo11•54m ago•0 comments

Ask HN: How to learn CUDA to professional level

27•upmind•1h ago•10 comments

I got a remote job for a EU company, I'd find it hard to go back to a US-based

https://www.businessinsider.com/remote-work-european-company-us-work-life-balance-2025-6
30•nixass•1h ago•29 comments

Apple Is on Defense at WWDC

https://www.theverge.com/apple/681739/wwdc-2025-epic-trial-apple-intelligence
13•pseudolus•1h ago•0 comments

Pen.el (A Holy OS)

https://github.com/semiosis/pen.el
3•qifzer•1h ago•0 comments

Keeping Ahead of Contagion

https://press.asimov.com/articles/contagion
1•mu0n•1h ago•0 comments

What I've Learned from 15 Years of Doing OKRs

https://eleganthack.com/what-ive-learned-from-15-years-of-doing-okrs/
1•adrianhoward•1h ago•0 comments

Getting nothing done: the art of stress-free non-productivity

https://bitfieldconsulting.com/posts/getting-nothing-done
1•mu0n•1h ago•0 comments

Putin's WeChat Wager

https://meduza.io/en/feature/2025/06/07/putin-s-wechat-wager
2•N19PEDL2•1h ago•0 comments

Ted Cruz bill: States that regulate AI will be cut out of $42B broadband fund

https://arstechnica.com/tech-policy/2025/06/ted-cruz-bill-states-that-regulate-ai-will-be-cut-out-of-42b-broadband-fund/
2•ndsipa_pomu•1h ago•2 comments

A Japanese lander crashed on the Moon after losing track of its location

https://arstechnica.com/space/2025/06/a-japanese-lander-crashed-on-the-moon-after-losing-track-of-its-location/
3•pseudolus•1h ago•0 comments

Astronomers thought Milky Way doomed to crash into Andromeda. Now not so sure

https://phys.org/news/2025-06-astronomers-thought-milky-doomed-andromeda.html
2•pseudolus•1h ago•0 comments