Maybe not but it sure would be funny.
But I believe the team at Antrophic looks for popular use cases like this one to improve their datasets. Same for every other big player in the LLM game.
1) give it text data from something that is annoying to copy and paste (eg labels off a chart or logs from a terrible web UI that doesn't make it easy to copy and paste).
2) give it screenshots of bugs, especially UI glitches.
It's extremely good at 1), can't remember when it got it wrong.
On 2) it _really_ struggled until opus 4.5, almost comically so, with me posting a screenshot and a description of the UI bug and it telling me "great it looks perfect! What next?"
With opus 4.5 it's not quite laughably as bad but still often misses very obvious problems.
It's very interesting to see the rapid progression on these benchmarks, as it's probably a very good proxy for "agentic vision".
I've came to the conclusion that browser use without vision (eg based on the DOM or accessibility trees) is a dead end, simply because "modern" websites tend to use a comical amount of tokens to render. So if this gets very good (close to human level/speed) then we have basically solved agents being able to browse any website/GUI effectively.
falcor84•1h ago
skybrian•1h ago
kaashif•55m ago
Just continuously learn and have a super duper massive memory. Maybe I just need a bazillion GPUs to myself to get that.
But no-one wants to manage context all the time, it's incidental complexity.
falcor84•46m ago
falcor84•54m ago
And just to be clear, I'm mentioning this because I think that Claude Plays Pokemon is a playground for any agentic AI doing any sort of long-term independent work; I believe that the solution needed here is going to bring us closer to a fully independent agent in coding and other domains. It reminds me of the codeclash.ai benchmark, where similar issues are seen across multiple "rounds" of an AI working on the same codebase.