Maybe not but it sure would be funny.
But I believe the team at Antrophic looks for popular use cases like this one to improve their datasets. Same for every other big player in the LLM game.
1) give it text data from something that is annoying to copy and paste (eg labels off a chart or logs from a terrible web UI that doesn't make it easy to copy and paste).
2) give it screenshots of bugs, especially UI glitches.
It's extremely good at 1), can't remember when it got it wrong.
On 2) it _really_ struggled until opus 4.5, almost comically so, with me posting a screenshot and a description of the UI bug and it telling me "great it looks perfect! What next?"
With opus 4.5 it's not quite laughably as bad but still often misses very obvious problems.
It's very interesting to see the rapid progression on these benchmarks, as it's probably a very good proxy for "agentic vision".
I've came to the conclusion that browser use without vision (eg based on the DOM or accessibility trees) is a dead end, simply because "modern" websites tend to use a comical amount of tokens to render. So if this gets very good (close to human level/speed) then we have basically solved agents being able to browse any website/GUI effectively.
/me click on the twitch link, skip to a random time.
The screen shows a Weezing encounter, the system mistook it as Grimer.
Not sure that's Claude, or bug in the glue code
What are they actual differences?
"One thing I found fascinating about watching Claude play is it wouldn't play around and experiment the way I'd expect a human to? It would stand still still trying to work out what to do next, move one square up, consider a long time, move one square down, and repeat. When I'd expect a human to immediately get bored and go as far as they could in all directions to see what was there and try interacting with everything. Maybe some cognitive analogue of boredom is useful for avoiding loops?"
- FiftyTwo[0]
I'm wondering if this is function of our training methods? They're sufficiently penalised against making "wrong moves", that they don't experiment?-[0]: https://www.lesswrong.com/posts/u6Lacc7wx4yYkBQ3r/insights-i...
falcor84•3w ago
skybrian•3w ago
kaashif•3w ago
Just continuously learn and have a super duper massive memory. Maybe I just need a bazillion GPUs to myself to get that.
But no-one wants to manage context all the time, it's incidental complexity.
falcor84•3w ago
onion2k•3w ago
Maybe we'll start needing to have daily stand-ups with our coding agents.
ben_w•3w ago
Even with humans, if a company is a car and the non-managers are the engine, meetings are the steering wheel and the mirror checks.
falcor84•3w ago
And just to be clear, I'm mentioning this because I think that Claude Plays Pokemon is a playground for any agentic AI doing any sort of long-term independent work; I believe that the solution needed here is going to bring us closer to a fully independent agent in coding and other domains. It reminds me of the codeclash.ai benchmark, where similar issues are seen across multiple "rounds" of an AI working on the same codebase.
skybrian•3w ago
vidarh•3w ago
stingraycharles•3w ago
It is exactly akin to a human that has to write down everything on notes, and re-read them every time.
solumunus•3w ago
skerit•3w ago
I've had great success using Claude Opus 4.5, as long as I hold its hand very tightly.
Constantly updating the CLAUDE.md file, adding an FAQ to my prompts, making sure it remembers what it tried before and what the outcome was. It became a lot more productive after I started doing this.
Using the "main" agent as an orchestrator, and making it do any useful work or research in subagents, has also really helped to make useful sessions last much longer, because as soon as that context fills up you have to start over.
Compaction is fucking useless. It tries to condense +/- 160.000 tokens into a few thousand tokens, and for anything a bit complex this won't work. So my "compaction" is very manual: I keep track of most of the things it has said during the session and what resulted from that. So it reads a lot more like a transcript of the session, without _any_ of the actual tool call results. And this has worked surprisingly well.
In the past I've tried various ways of automating this process, but it's never really turned out great. And none of the LLMs are good at writing _truly_ useful notes.
formerly_proven•3w ago
PunchyHamster•3w ago
LeifCarrotson•3w ago
I'm pessimistic that future paradigm AIs will change this anytime soon - it appears that Noosphere89 seems to think that future paradigm AIs will not have these same limitations, but it seems obvious to me that the architecture of a GPT (the "P" standing for "Pre-trained") cannot "learn," which is the fundamental problem with all these systems.