Insights into Claude Opus 4.5 from Pokémon

https://www.lesswrong.com/posts/u6Lacc7wx4yYkBQ3r/insights-into-claude-opus-4-5-from-pokemon

123•surprisetalk•1mo ago

Comments

falcor84•3w ago

The idea of Claude having "anterograde amnesia" and the top-rated comment there by Noosphere89 really resonated with me:

  "I would analogize this to a human with anterograde amnesia, who cannot form new memories, and who is constantly writing notes to keep track of their life. The limitations here are obvious, and these are limitations future Claudes will probably share unless LLM memory/continual learning is solved in a better way."

  This is an extremely underrated comparison, TBH. Indeed, I'd argue that frozen weights + lack of a long-term memory are easily one of the biggest reasons why LLMs are much more impressive than useful at a lot of tasks (with reliability being another big, independent issue).

  It emphasizes 2 things that are both true at once: LLMs do in fact reason like humans and can have (poor-quality) world-models, and there's no fundamental chasm between LLM capabilities and human capabilities that can't be cured by unlimited resources/time, and yet just as humans with anterograde amnesia are usually much less employable/useful to others than people who do have long-term memory, current AIs are much, much less employable/useful than future paradigm AIs.

skybrian•3w ago

For a coding agent, the project "learns" as you improve its onboarding docs (AGENTS.md), code, and tests. If you assume you're going to start a new conversation for each task and the LLM is a temp that's going to start from scratch, you'll have a better time.

kaashif•3w ago

Yeah but it feels terrible. I put as much as I can into Claude skills and CLAUDE.md but the fact that this is something I even have to think about makes me sad. The discrete points where the context gets compacted really feel bad and not like how I think AGI or whatever should work.

Just continuously learn and have a super duper massive memory. Maybe I just need a bazillion GPUs to myself to get that.

But no-one wants to manage context all the time, it's incidental complexity.

falcor84•3w ago

I agree with essentially everything you said, except for the final claim that managing context is incidental complexity. From what I know of cognitive science, I would argue that context management is a central facet of intelligence, and a lot of the success of humans in society is dependent on their ability to do so. Looking at it from the other side, executive function disorders such as ADHD offer significant challenges for many humans, and they seem to be not quite entirely unlike these context issues that Claude faces.

onion2k•3w ago

no-one wants to manage context all the time

Maybe we'll start needing to have daily stand-ups with our coding agents.

ben_w•3w ago

Already should be. Though given the speed difference, when every equivalent of a human-day of work has been done, rather than every 24 hours of wall clock time.

Even with humans, if a company is a car and the non-managers are the engine, meetings are the steering wheel and the mirror checks.

falcor84•3w ago

But that's the thing: Claude Plays Pokemon is an experiment in having Claude work fully independently, so there's no "you" who would improve its onboarding docs or anything else, it has to do so on its own. And as long as it cannot do so reliably, it effectively has anterograde amnesia.

And just to be clear, I'm mentioning this because I think that Claude Plays Pokemon is a playground for any agentic AI doing any sort of long-term independent work; I believe that the solution needed here is going to bring us closer to a fully independent agent in coding and other domains. It reminds me of the codeclash.ai benchmark, where similar issues are seen across multiple "rounds" of an AI working on the same codebase.

skybrian•3w ago

Sure, it's not close to fully independent. But I was interpreting "much, much less employable" as not very useful for programming in its current state, and I think it is quite useful.

vidarh•3w ago

No, but it can produce the onboarding docs itself with some "bootstrap" prompting. E.g. give it a scratchpad to write its own notes in, and direct it to use it liberally. Give it a persistent todo list, and direct it to use it liberally. Tell it to keep a work log. Tell it to commit early and often - you can squash things later, and Claude is very good at navigating git logs.

stingraycharles•3w ago

But these docs are the notes, it constantly needs to be re-primed with them, an approach which doesn’t scale. How much of this knowledge can you really put in these agent docs? There's only so much you can do, and for any serious-scale projects, there's SO much knowledge that needs to be captured. Not just "do this, do that", but also context about why certain decisions were made (rationale, business context, etc).

It is exactly akin to a human that has to write down everything on notes, and re-read them every time.

solumunus•3w ago

They don’t need to re read them all every time though, they revise the relevant context for a particular task, exactly as a human would need to do when revisiting an area of the application that they have no recent exposure to… If you’re putting everything into one master MD file in root you’re going very wrong.

skerit•3w ago

I agree, I started doing something like that a while ago.

I've had great success using Claude Opus 4.5, as long as I hold its hand very tightly.

Constantly updating the CLAUDE.md file, adding an FAQ to my prompts, making sure it remembers what it tried before and what the outcome was. It became a lot more productive after I started doing this.

Using the "main" agent as an orchestrator, and making it do any useful work or research in subagents, has also really helped to make useful sessions last much longer, because as soon as that context fills up you have to start over.

Compaction is fucking useless. It tries to condense +/- 160.000 tokens into a few thousand tokens, and for anything a bit complex this won't work. So my "compaction" is very manual: I keep track of most of the things it has said during the session and what resulted from that. So it reads a lot more like a transcript of the session, without _any_ of the actual tool call results. And this has worked surprisingly well.

In the past I've tried various ways of automating this process, but it's never really turned out great. And none of the LLMs are good at writing _truly_ useful notes.

formerly_proven•3w ago

The way amp does this explicitly with threads and hand-offs (and of course the capability to summarize/fetch parts of other threads on demand as opposed to eagerly, like compaction essentially tries to do) makes imho a ton of sense for the way LLMs currently work. "Infinite scroll but not actually" is an inferior approach. I'm surprised others aren't replicating this approach; it's easy to understand, simple to implement and works well.

PunchyHamster•3w ago

mfw people do better documentation for AI than for other people in the project

LeifCarrotson•3w ago

It's really easy to anthropomorphize an LLM or computer program, but they're fundamentally alien systems.

I'm pessimistic that future paradigm AIs will change this anytime soon - it appears that Noosphere89 seems to think that future paradigm AIs will not have these same limitations, but it seems obvious to me that the architecture of a GPT (the "P" standing for "Pre-trained") cannot "learn," which is the fundamental problem with all these systems.

oceansky•3w ago

I wonder if there's someone at Antrophic working to fine-tune the model's pokemon playing ability specifically.

Maybe not but it sure would be funny.

dbish•3w ago

If I recall correctly, a prior interview about claude plays pokemon stated they purposely chose pokemon as a use case that was not meant to be trained/finetuned on. That's what makes it an interesting problem, so hopefully they aren't.

oceansky•3w ago

I believe the testing itself is done in very good faith.

But I believe the team at Antrophic looks for popular use cases like this one to improve their datasets. Same for every other big player in the LLM game.

martinald•3w ago

This actually matches my experience quite well. I use vision (often) to try and do 2 main things in Claude code:

1) give it text data from something that is annoying to copy and paste (eg labels off a chart or logs from a terrible web UI that doesn't make it easy to copy and paste).

2) give it screenshots of bugs, especially UI glitches.

It's extremely good at 1), can't remember when it got it wrong.

On 2) it _really_ struggled until opus 4.5, almost comically so, with me posting a screenshot and a description of the UI bug and it telling me "great it looks perfect! What next?"

With opus 4.5 it's not quite laughably as bad but still often misses very obvious problems.

It's very interesting to see the rapid progression on these benchmarks, as it's probably a very good proxy for "agentic vision".

I've came to the conclusion that browser use without vision (eg based on the DOM or accessibility trees) is a dead end, simply because "modern" websites tend to use a comical amount of tokens to render. So if this gets very good (close to human level/speed) then we have basically solved agents being able to browse any website/GUI effectively.

j16sdiz•3w ago

> You may be surprised to learn that ClaudePlaysPokemon is still running today

/me click on the twitch link, skip to a random time.

The screen shows a Weezing encounter, the system mistook it as Grimer.

Not sure that's Claude, or bug in the glue code

eab-•3w ago

I find it surprising that GPT managed to complete (with indeed a harness that doesn't seem very powerful!) and Claude still cannot.

awakeasleep•3w ago

Don’t they give got a minimap?

What are they actual differences?

Folcon•3w ago

   "One thing I found fascinating about watching Claude play is it wouldn't play around and experiment the way I'd expect a human to? It would stand still still trying to work out what to do next, move one square up, consider a long time, move one square down, and repeat. When I'd expect a human to immediately get bored and go as far as they could in all directions to see what was there and try interacting with everything. Maybe some cognitive analogue of boredom is useful for avoiding loops?"
    - FiftyTwo[0]

I'm wondering if this is function of our training methods? They're sufficiently penalised against making "wrong moves", that they don't experiment?

-[0]: https://www.lesswrong.com/posts/u6Lacc7wx4yYkBQ3r/insights-i...

SectorC: A C Compiler in 512 bytes

The F Word

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Speed up responses with fast mode

Software factories and the agentic moment

Stories from 25 Years of Software Development

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

I write games in C (yes, C)

First Proof

Show HN: A luma dependent chroma compression algorithm (image compression)

The Waymo World Model

Al Lowe on model trains, funny deaths and working with Disney

Vocal Guide – belt sing without killing yourself

Start all of your commands with a comma (2009)

Reinforcement Learning from Human Feedback

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Selection Rather Than Prediction

Coding agents have replaced every framework I used

The AI boom is causing shortages everywhere else

A Fresh Look at IBM 3270 Information Display System

France's homegrown open source online office suite

72M Points of Interest

We mourn our craft

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Learning from context is harder than we thought

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

History and Timeline of the Proco Rat Pedal (2021)

SectorC: A C Compiler in 512 bytes

The F Word

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Speed up responses with fast mode

Software factories and the agentic moment

Stories from 25 Years of Software Development

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

I write games in C (yes, C)

First Proof

Show HN: A luma dependent chroma compression algorithm (image compression)

The Waymo World Model

Al Lowe on model trains, funny deaths and working with Disney

Vocal Guide – belt sing without killing yourself

Start all of your commands with a comma (2009)

Reinforcement Learning from Human Feedback

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Selection Rather Than Prediction

Coding agents have replaced every framework I used

The AI boom is causing shortages everywhere else

A Fresh Look at IBM 3270 Information Display System

France's homegrown open source online office suite

72M Points of Interest

We mourn our craft

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Learning from context is harder than we thought

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

History and Timeline of the Proco Rat Pedal (2021)

Insights into Claude Opus 4.5 from Pokémon

Comments