The right way to handle this is not to build it grids and whatnot, which all get blown away by the embedding encoding but to instruct it to build image processing tools of its own and to mandate their use in constructing the coordinates required and computing the eccentricity of the pattern etc in code and language space. Doing it this way you can even get it to write assertive tests comparing the original layout to the final among various image processing metrics. This would assuredly work better, take far less time, be more stable on iteration, and fits neatly into how a multimodal agentic programming tool actually functions.
That's not how to successfully use LLM's for coding in my experience. It is however perhaps a good demonstration of Claude's poor spatial reasoning skills. Another good demonstration of this is the twitch.tv/ClaudePlaysPokemon where Claude has been failing to beat pokemon for months now.
> That's not how to successfully use LLM's for coding in my experience.
Yeah agree. I think I was just a little surprised it couldn't one-shot given the simplicity.
https://github.com/anthropics/claude-code/blob/main/plugins/...
This article is a bit negative. Claude gets close , it just can't get the order right which is something OP can manually fix.
I prefer GitHub Copilot because it's cheaper and integrates with GitHub directly. I'll have times where it'll get it right, and times when I have to try 3 or 4 times.
This is also fairly contrived, you know? It's not a realistic limitation to rebuild HTML from a screenshot because of course if I have the website loaded I can just download the HTML.
???
This is precisely the workflow when a traditional graphic designer mocks up a web/app design, which still happens all the time.
They sketch a design in something like Photoshop or Illustrator, because they're fluent in these tools and many have been using them for decades, and somebody else is tasked with figuring out how to slice and encode that design in the target interactive tech (HTML+CSS, SwiftUI, QT, etc).
Large companies, design agencies, and consultancies with tech-first design teams have a different workflow, because they intentionally staff graphic designers with a tighter specialization/preparedness, but that's a much smaller share of the web and software development space than you may think.
There's nothing contrived at all about this test and it's a really great demonstration of how tools like Claude don't take naturally to this important task yet.
I've tried these tools a number of times and spent a good bit of effort on learning to maximize the return. By the time you know what prompt to write you've solved the problem yourself.
what if the LLM gets something wrong that the operator (a junior dev perhaps) doesn't even know it's wrong? that's the main issue: if it fails here, it will fail with other things, in not such obvious ways.
As the post shows, you can't trust them when they think they solved something but you also can't trust them when they think they haven't[0]. The things are optimized for human preference, which ultimately results in this being optimized to hide mistakes. After all, we can't penalize mistakes in training when we don't know the mistakes are mistakes. The de facto bias is that we prefer mistakes that we don't know are mistakes than mistakes that we do[1].
Personally I think a well designed tool makes errors obvious. As a tool user that's what I want and makes tool use effective. But LLMs flip this on the head, making errors difficult to detect. Which is incredibly problematic.
[0] I frequently see this in a thing it thinks is a problem but actually isn't, which makes steering more difficult.
[1] Yes, conceptually unknown unknowns are worse. But you can't measure unknown unknowns, they are indistinguishable from knowns. So you always optimize deception (along with other things) when you don't have clear objective truths (most situations).
If the tool needs you to check up on it and fix its work, it's a bad tool.
It’s not that binary imo. It can still be extremely useful and save a ton of time if it does 90% of the work and you fix the last 10%. Hardly a bad tool.
It’s only a bad tool if you spent more time fixing the results than building it yourself, which sometimes used to be the case for LLMs but is happening less and less as they get more capable.
I agree that there are domains for which 90% good is very, very useful. But 99% isn't always better. In some limited domains, it's actually worse.
Humans don't get it right 100% or the time.
All tools have failure modes and truthfully you always have to check the tool's work (which is your work). But being a master craftsman is knowing all the nuances behind your tools, where they work, and more importantly where they don't work.
That said, I think that also highlights the issue with LLMs and most AI. Their failure modes are inconsistent and difficult to verify. Even with agents and unit tests you still have to verify and it isn't easy. Most software bugs are created from subtle things, often which compound. Which both those things are the greatest weaknesses of LLMs: nuance and compounding effects.
So I still think they aren't great tools, but I do think they can be useful. But that also doesn't mean it isn't common for people to use them well outside the bounds of where they are generally useful. It'll be fine a lot of times, but the problem is that it is like an alcohol fire[0]; you don't know what's on fire because it is invisible. Which, after all, isn't that the hardest part of programming? Figuring out where the fire is?
It's remarkable how high our expectations have been steadily creeping.
What's with the panicked pleas and need to preserve the site, assuming locally...?
It seems to me the post is about how Claude fails to recreate a very simple website from 1996.
- "First, calculate the orbital radius. To do this accurately, measure the average diameter of each planet, p, and the average distance from the center of the image to the outer edge of the planets, x, and calculate the orbital radius r = x - p"
- "Next, write a unit test script that we will run that reads the rendered page and confirms that each planet is on the orbital radius. If a planet is not, output the difference you must shift it by to make the test pass. Use this feedback until all planets are perfectly aligned."
That said, I love this project. haha
The loop here, imo, refers to the feedback loop. And it's true that ideally there should be no human involvement there. A tight feedback loop is as important for llms as it is for humans. The more automated you make it, the better.
One of the keys to being productive with LLMs is learning how to recognize when it's going to take much more effort to babysit the LLM into getting the right result as opposed to simply doing the work yourself.
Wrt unit test script, let's take Claude out of the equation, how would you design the unit test? I kept running into either Claude or some library not being capable of consistently identifying planet vs non planet which was hindering Claude's ability to make decisions based on fine detail or "pixel coordinates" if that makes sense.
> What he produces
I feel like personifying LLMs more than they currently are is a mistake people make (though humans always do this), they're not entities, they don't know anything. If you treat them too human you might eventually fool yourself a little too much.
I don't doubt that it is possible eventually, but I haven't had much luck.
Something that seemed to assist was drawing a multi coloured transparent chequerboard, if the AI knows the position of the grid colours it can pick out some relative information from the grid.
I have also not had luck with any kind of iterative/guess-and-check approach. I assume the models are all trained to one-shot this kind of thing and struggle to generalize to what are effectively relative measurements.
> After these zoom attempts, I didn't have any new moves left. I was being evicted. The bank repo'd my car. So I wrapped it there.
https://knowyourmeme.com/memes/my-father-in-law-is-a-builder...
You say that as if that’s uncommon.
I am lucky that I don't depend on this for work at a corporation. I'd be pulling my hair out if some boss said "You are going to be doing 8 times as much work using our corporate AI from now on."
Why not use wget to mirror the website? Unless you're being sarcastic.
$ wget --mirror --convert-links --adjust-extension --page-requisites --no-parent http://example.org
Source: https://superuser.com/questions/970323/using-wget-to-copy-we...
“Reference the original uploaded image. Between each image in the clock face, create lines to each other image. Measure each line. Now follow that same process on the app we’ve created, and adjust the locations of each image until all measurements align exactly.”
https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
And
> I'm an engineering manager
I can't tell if this is an intentional or unintentional satire of the current state of AI mandates from management.
- Read an .icc file from disk
- parsed the file and extracted the VCGT (video card gamma table)
- wrote the VCGT to the video card for a specified display via amdgpu driver APIs
The only thing I had to fix was the ICC parsing, where it would parse header strings in the wrong byte-order (they are big-endian).
Because the rendered output (pixels, not HTML/CSS) is not fed as data in the training. You will find tons of UI snippets and questions, but they rarely included screenshots. And if they do, the are not scraped.
I don’t think it’s fair to call someone who used Stack Overflow to find a similar answer with samples of code to copy to their project an asshole.
Stack Overflow offers access to other peoples’ work, and developers combined those snippets and patterns into their own projects. I suspect attribution is low.
Or are you saying that every piece of code you ever wrote was 100% original and not adapted from any previous codebase you ever worked in or any book / reference you ever read?
Do you have papers to back this up ? That was also my reaction when i saw some really crazy accurate comments on some vibe coded piece of code, but i couldn't prove it, and thinking about it now i think my intuition was wrong (ie : LLMs do produce original complex code).
If that does not work then the moment you introduce AI you cap their capabilities unless humans continue to create original works to feed the AI. The conclusion - to me, at least - is that these pieces of software regurgitate their inputs, they are effectively whitewashing plagiarism, or, alternatively, their ability to generate new content is capped by some arbitrary limit relative to the inputs.
We all stand on the shoulders of giants and learn by looking at others’ solutions.
To me that's proof positive they know their output is mangled inputs, they need that originality otherwise they will sooner or later drown in nonsense and noise. It's essentially a very complex game of Chinese whispers.
(I created a template language for JSON and added branching and conditionals and realized I had a whole programming language. Really proud of my originality until i was reading Ted Nelson's Computer Lib/Dream Machines and found out I reinvented TRAC, and to some extent, XSLT. Anyway LLMs are very good at reasoning about it because it can be constrained by a JSON schema. People who think LLMs only regurgitate haven't given it a fair shot)
Would we use the same word if two different humans wrote code that solved two different problems, but one part of each problem was somewhat analogous to a different aspect of a third human's problem, and the third human took inspiration from those parts of both solutions to create code that solved a third problem?
What if it were ten different humans writing ten different-but-related pieces of code, and an eleventh human piecing them together? What if it were 1,000 different humans?
I think "plagiarism", "inspiration", and just "learning from" fall on some continuous spectrum. There are clear differences when you zoom out, but they are in degree, and it's hard to set a hard boundary. The key is just to make sure we have laws and norms that provide sufficient incentive for new ideas to continue to be created.
Programmers are not expected to add an addendum to every file listing all the books, articles, and conversations they've had that have influenced the particular code solution. LLMs are trained on far more sources that influence their code suggestions, but it seems like we actually want a higher standard of attribution because they (arguably) are incapable of original thought.
What do you mean? The programmers work is literally combining the existing patterns into solutions for problems.
Humans do this all the time.
Who?
I'm not mad at the dogs, I'm mad at the stupid investors and the lying owners.
Now that's a novel sentence if I've ever read one.
Calling Claude (or GPT-5 or Gemini or my bash terminal for that matter) a "he" seems absurd to the point of hilarity.
In my mind, they've always firmly been "it"s.
In other languages, chairs have a gender, along with other everyday items like scissors and it doesn't especially make logical sense, although you can squint and tell a story as why something is the gender that's been assigned. Thus making the gender of AI simply a matter"that's just how things are".
Give Claude a way to iteratively poke at what it created (such as a playwright harness), and screenshot of what you want, and maybe a way to take a screenshot in Playwright and I think you will get much closer. You might even be able to one shot it.
I’ve always wondered what would happen if I gave it a screenshot and told it to iterate until the Playwright screenshot matched the mock screenshot, pixel perfect. I imagine it would go nuts, but after a few hours I think it would likely get it. (Either that or minor font discrepancies and rounding errors would cause it to give up…)
Absolute positioning wasn't available until CSS2 in 1998. This is just a table with crafty use of align, valign, colspan, and rowspan.
Like the web was meant to be. An interpreted hypertext format, not a pixel-perfect brochure for marketing execs.
what?
(Fun fact, the most amazing layout foot-guns, then: Effective font sizes and line-heights are subject to platform and configuration (e.g., Win vs Mac); Netscape does paragraph spacing at 1.2em, IE at 1em (if this matters, prefer `<br>` over paragraphs); frames dimensions in Netscape are always calculated as integer percentages of window dimensions, even if you provide absolute dimensions in pixels, while IE does what it says on the tin (a rare example), so they will be the same only by chance and effective rounding errors. And, of course, screen gamma is different on Win and Mac, so your colors will always be messed up – aim for a happy medium.)
Anthropic's leadership and researchers continue to this day to post messages saying engineering will be fully automated. I can go find recent messages on X if you'd like.
This forum is comprised mostly of engineers, who will be the most impacted if their vision of the world pans out.
YC depends on innovation capital to make money. If the means of production are centralized, how does YC make any money at all from engineers? Such a world will be vertically and horizontally integrated, not democratically spread for others to take advantage of.
Now I don't think that's what's going to happen, but that's what the messaging has been and continues to be from Anthropic's leadership, researchers, and ICs.
Why should we support companies like this?
We shouldn't we advocate for open models where any market participants can fully utilize and explore the competitive gradients?
I don't think I'm saying anything controversial here.
Furthermore, if this pans out like it seems it will - a set of three or four AI hyperscalers - we'll also be in the same situation we have today with the big tech hyperscalers.
Due to a lax regulatory environment, these companies put a ceiling on startup exits by funding internal competition, buying competitors, etc. I don't see how the situation will improve in an AI world.
If you're a capitalist, you want competition to be fierce and fair. You don't want concentration of power.
I can see how an Anthropic IC might not like this post, but this should be fairly reasonable for everyone else who would like to see more distribution of power.
Sure, you can argue this is some sort of modern ACID-Test - but the ACID tests checked for real-world use-cases. This feels more like 'I have this one very specific request, the machine doesn't perfectly fullfill it, so the machine is at fault.'. Complaining from a high pedestal.
I'm more surprised at how close Claude got in its reimagined SpaceJam-site.
Just out of curiosity, how big was what you considered Claude's best attempt to be?
Why not downscale the image and feed it as a second input so that entire planets fit into a patch and instruct it to use the doensampled image for coarse coordinate estimation
I'm not well at the moment, perhaps your ML model has flu?!!
They all fail massively 100% of the time. Even if I break it down into chunks once they get to the chunks that matter the most (i.e. physics, collision detection and resolution, event handling and game logic) they all break down horribly and no amount of prompting back and forth will fix it.
Help, I can't recreate a website with AI! There's no other way, no way I could fix up some HTML code! Believe me, I'm an engineering manager with a computer science degree!
Absolutely disgusting.
Can confirm: Claude is weirdly good at generating functional game logic from vague prompts, but spatial precision is a constant battle. Anything involving exact pixel positions needs validation/correction layers on top.
The suggestion upthread about having it write its own measurement tools seems promising - haven't tried that approach yet.
Interesting that it handles this fine (functional orbital mechanics, animation) but would probably struggle to recreate the exact pixel positions of the Space Jam layout. Confirms the pattern: good at "make something like X" but bad at "recreate X exactly."
They are vastly less intelligent than a human and logical leaps that make sense to you make no sense to Claude. It has no concept of aesthetics or of course any vision.
All that said; it got pretty close even with those impediments! (It got worse because the writer tried to force it to act more like a human would)
I think a better approach would be to write a tool to compare screenshots, identity misplaced items and output that as a text finding/failure state. claude will work much better because your dodging the bits that are too interpretive (that humans rock at and LLMs don't)
I would more phrase it like that they are a completely alien “intelligence” that cant really be compared to human intelligence
https://github.com/steipete/agent-scripts/blob/main/scripts/...
We can’t. We don’t know how to do it.
Edit: yup very easy for claude. Most of the wall time was spent installing cv2 in a venv (uv dropped python 3.14 on claude, which seemed to get stuck compiling numpy or something so I interrupted and told claude to just use python 3.13 instead).
ME: In this directory there are two images: @p-pressbox.gif and @screenshot.png. @p-pressbox.gif is an image that exists somewhere inside @screenshot.png Write a script that finds the precise location
CLAUDE: Looking at the images, I can see that p-pressbox.gif (the spaceship with "PRESS BOX SHUTTLE" text) appears in the upper-left area of screenshot.png. I'll write a Python script using OpenCV to find the precise location.
[ some venv stuff/telling claude to use uv and python 3.13 ]
CLAUDE: The script found the precise location of p-pressbox.gif within screenshot.png:
| Measurement | Value |
|---------------------|-----------------|
| Top-left corner | (87, 110) |
| Bottom-right corner | (218, 166) |
| Center | (152, 138) |
| Template size | 131 × 56 pixels |
| Match confidence | 97.2% |
The script is at /tmp/search/find_image.py and uses OpenCV's template matching with normalized cross-correlation.
ME> As part of the output create a visualization of the @p-pressbox.gif positioned on top of the original screenshotCLAUDE: Done! The visualization shows a green rectangle highlighting the exact location where p-pressbox.gif was found within the screenshot. The match is at coordinates (87, 110) to (218, 166).
I confirmed the result was correct.
Common at the time before flash took over.
"The Space Jam website is simple: a single HTML page, absolute positioning for every element, and a tiling starfield GIF background.".
This is not true, the site is built using tables, not positioning at all, CSS wasn't a thing back then...
Here was its one-shot attempt at building the same type of layout (table based) with a screenshot and assets as input: https://i.imgur.com/fhdOLwP.png
The entire website is fairly small so this seems feasible.
Usually there’s a big difference between a website’s final code and its source code because of post processing but that seems like a totally solvable Claude problem.
Sure LLMs aren’t great with images, but it’s not like the person who originally wrote the Space Jam website was meticulously messing around with positioning from a reference image to create a circular orbit — they just used the CSS tools they had to create an acceptable result. Claude can do the same.
Perhaps the best method is to re-create, rather than replicate the design.
stonecharioteer•5h ago
In 1996, We had only css1. Ask it to use tables to do this, perhaps.
lagniappe•5h ago
thecr0w•5h ago
lagniappe•5h ago
Nonetheless, here is a link to a list of the specs you asked for: https://www.w3.org/Style/History/Overview.en.html
boie0025•3h ago
lagniappe•3h ago
wanderingstan•4h ago
thecr0w•5h ago
I tried your suggestion and also tried giving it various more general versions of the limitations presented by earlier generations.
Claude's instinct initially was actually to limit itself to less modern web standards.
Unfortunately, nothing got those planets to be in the right place.