Letting Claude play text adventures

https://borretti.me/article/letting-claude-play-text-adventures

154•varjag•3w ago

Comments

skybrian•2w ago

It seems like asking Claude to keep notes somehow would work better. An AGENTS file and a TODO file? An issue tracker like beads? Lots of things to try.

pflenker•2w ago

For a game like anchorhead, which is famous in its niche, shouldn’t Claude already know it sufficiently to just solve it right away? I would expect that its data source contained multiple discussions and walkthroughs of the game.

ratg13•2w ago

It's very likely the model didn't stop to question if the game they were playing was something they knew already, and just assumed it was a puzzle created for it.

sfjailbird•2w ago

You can see Claude's responses in the repo. The first one is:

Ah, Anchorhead! One of the most celebrated pieces of interactive fiction ever written

vunderba•2w ago

I would think so. I'd be far more interested in a comparison of LLMs (no internet search allowed) playing against IF games released in the past month.

Jweb_Guru•2w ago

Yeah, I do not find performances like this very impressive.

IgorPartola•2w ago

Honestly I am curious how it would do if it did have a walkthrough.

zetalyrae•2w ago

I expect it's somewhere in the training data, but it's very unlikely to be salient. A few textfiles here and there in the ocean of the Internet is nothing. If Claude had memorized the walkthrough, it would have performed better.

brianjeong•2w ago

You could say the same about Pokemon - the models still struggle quite a bit.

imiric•2w ago

> By the time you get to day two, each turn costs tens of thousands of input tokens

This behavior surprised me when I started using LLMs, since it's so counterintuitive.

Why does every interaction require submitting and processing all data in the current session up until that point? Surely there must be a way for the context to be stored server-side, and referenced and augmented by each subsequent interaction. Could this data be compressed in a way to keep the most important bits, and garbage collect everything else? Could there be different compression techniques depending on the type of conversation? Similar to the domain-specific memories and episodic memory mentioned in the article. Could "snapshots" be supported, so that the user can explore branching paths in the session history? Some of this is possible by manually managing context, but it's too cumbersome.

Why are all these relatively simple engineering problems still unsolved?

iamjackg•2w ago

It's not unsolved, at least not the first part of your question. In fact it is a feature offered by all main LLM providers!

- https://platform.openai.com/docs/guides/prompt-caching

- https://platform.claude.com/docs/en/build-with-claude/prompt...

- https://ai.google.dev/gemini-api/docs/caching

imiric•2w ago

Ah, that's good to know, thanks.

But then why is there compounding token usage in the article's trivial solution? Is it just a matter of using the cache correctly?

StevenWaterman•2w ago

Cached tokens are cheaper (90% discount ish) but not free

moyix•2w ago

Also, unlike OpenAI, Anthropic's prompt caching is explicit (you set up to 4 cache "breakpoints"), meaning if you don't implement caching then you don't benefit from it.

netcraft•2w ago

thats a very generous way of putting it. Anthropic's prompt caching is actively hostile and very difficult to implement properly.

igravious•2w ago

dumb question, but is prompt caching available to Claude Code … ?

stavros•2w ago

If you're using the API, yes. If you have a subscription, you don't care, as you aren't billed per prompt (you just have a limit).

sfjailbird•2w ago

Cool! I would like to see the game sessions.

Edit: they are there in the repo: https://github.com/eudoxia0/claude-plays-anchorhead/tree/mas...

tiahura•2w ago

Claude code, nethack, and tmux are fun to experiment with.

brimtown•2w ago

I’m currently letting Claude build and play its own Dwarf Fortress clone, as an installable plugin in Claude Code

https://github.com/brimtown/claude-fortress

twohearted•2w ago

This is a great idea and great work.

Context is intuitively important, but people rarely put themselves in the LLM's shoes.

What would be eye-opening would be to create an LLM test system that periodically sends a turn to a human instead of the model. Would you do better than the LLM? What tools would you call at that moment, given only that context and no other knowledge? The way many of these systems are constructed, I'd wager it would be difficult for a human.

The agent can't decide what is safe to delete from memory because it's a sort of bystander at that moment. Someone else made the list it received, and someone else will get the list it writes. The logic that went into why the notes exist is lost. LLMs are living the Christopher Nolan film Memento.

fragmede•2w ago

The canonical example I use is how good are (philosophical) you at programming on a whiteboard given one shot and no tools? Vs at your computer given access to everything? So judging LLMs on that rubric seems as dumb as judging humans by that rubric.

lukev•2w ago

This is a great framework to experiment with memory architectures.

Everything the author says about memory management tracks with my intuition of how CC works, including my perception that it isn't very good at explicitly managing its own memory.

My next step in trying to get it to work well on a bigger game would be to try to build a more "intuitive" memory tool, where the textual description of a room or an item would automatically RAG previous interactions with that entity into context.

That also is closer to how human memory works -- we're instantly reminded of things via a glimpse, a sound, a smell... we don't need to (analogously) write in or search our notebook for basic info we already know about the world.

daxfohl•2w ago

I tried something similar, but distilled to "solve this maze" as a first-person text adventure, and while it usually solved it eventually, it almost always backtracked through fully-explored dead ends multiple times before finally getting to the end. I was pretty surprised by this, as I expected they'd be able to traverse more or less optimally most of the time.

I tried basic raw long-context chat, various approaches of getting it to externalize the state (i.e. prompting it to emit the known state of the maze after each move, but not telling it exactly what to emit or how to format it), and even allowing it to emit code to execute after each turn (so long as it was a serialization/storage algorithm, not a solver in itself), but it invariably would get lost at some point. (It always neglected to emit a key for which coordinate was which, and which direction was increasing. Even if I explicitly told it to do this, it would frequently forget to at some point anyway and get turned around again. If I explicitly provided the key each move, it would usually work).

Of course it had no problem writing an optimal algorithm to solve mazes when prompted. In fact it basically wrote itself; I have no idea how to write a maze generator. I thought the disparity was interesting.

Note the mazes had the start and end positions inside the maze itself, so they weren't trivially solvable by the "follow wall to the left" algorithm.

This was last summer so maybe newer models would do better. I also stopped due to cost.

sfjailbird•2w ago

Having read through the entire game session, Claude plays the game admirably! For example, it finds a random tin of oily fish somewhere, and later tries (unsuccessfully) to use it to oil a rusty lock. Later it successfully solves a puzzle inside the house by thoroughly examining random furniture and picking up subtle clues about what to do, based on it.

It did so well that I can't not suspect that it used some hints or walkthroughs, but then again it did a bunch of clueless stuff too, like any player new to the game.

For one thing, this would be a great testing tool for the author of such a game. And more generally, the world of software testing is probably about to take some big leaps forward.

macNchz•2w ago

As a fan of text adventures who has played many over the years—Anchorhead is hard. It was kind of a white whale for me over many years until I finally beat it during the pandemic lockdown.

suzzer99•2w ago

How does it compare in difficulty and scope to the original Adventure? I guess actually known as Colossal Cave Adventure? When I played it on my uncle's terminal in the 70s it was just called Adventure.

I stayed up all night and didn't get very far. I finally saw a solution online and I wasn't even close.

woggy•2w ago

Very interesting, seems like a good framework to test and experiment with memory. I am curious why it wasn't able to solve it considering it is a well known game. Would be interesting if puzzle games like this could be generated so we know it's not already been trained on it.

I wonder if the improvements due to different memory system approaches apply in a similar way to tasks that are in its training history vs those that are not.

justinclift•2w ago

This would be interesting to try with local models, where the token costs and token limits are quite different.

nitwit005•2w ago

I see people have put the transcripts of full adventure game playthroughs online, so it's reasonably likely games are present in the training data: https://dwheeler.com/anchorhead/anchorhead-transcript.txt

You can probably find games where that's not true, as people are still releasing text adventure games occasionally.

s-macke•2w ago

To compensate for this I have put 100 or more terrible text adventure game runs from AI online [0]. And with this post I even link to them directly for the next web crawler.

[0] https://github.com/s-macke/AdventureAI/tree/master/storydump

PaulHoule•2w ago

It’s trained to interact with text transcripts, it is not trained to work with that memory you built for it. If it was trained to do so I might be able to break into the real estate office in ten turns.

kaiokendev•2w ago

One thing I had fun doing last year was having Claude parse some gamebook PDFs I got on archive.org, split them out into sections, and build a wrapper for presenting the sections with possible choices and just watching it play through the books by itself. You can do this with some D&D adventures as well, Claude Code has gotten good enough to run ToEE pretty well.

diamond559•2w ago

Great, we can burn acres of dead forests so that my computer can play ddos games. What an exciting future!

goodmythical•2w ago

what else are we going to do with them? Carve them in to housing to house more humans that produce more carbon?

Leave them to rot?

Wouldn't it be best to clearcut a dead forest to allow more plants to grow to increase carbon capture?

jryle70•2w ago

How much energy is burnt so that you can play your video games, or whatever hobbies you have?

apsdsm•2w ago

Probably less than the cost of an LLM doing exactly the same thing?

BoredomIsFun•2w ago

Not sure about that - a local 12b-32b LLM consumes miniscule amount of energy compared to gaming on the same hardware.

vunderba•2w ago

Using AI to drive text adventures / rogues has been pretty popular for a while now - I remember seeing a pretty dismal performance (although it was over a year ago) where somebody was trying to use an LLM to drive a game of Zork.

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

The Waymo World Model

What Is Ruliology?

How we made geo joins 400× faster with H3 indexes

Jeffrey Snover: "Welcome to the Room"

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: I spent 4 years building a UI design tool with only the features I use

Sheldon Brown's Bicycle Technical Info

Microsoft open-sources LiteBox, a security-focused library OS

Hackers (1995) Animated Experience

Show HN: If you lose your memory, how to regain access to your computer?

An Update on Heroku

Dark Alley Mathematics

Vocal Guide – belt sing without killing yourself

Delimited Continuations vs. Lwt for Threads

PC Floppy Copy Protection: Vault Prolok

Start all of your commands with a comma

Was Benoit Mandelbrot a hedgehog or a fox?

How to effectively write quality code with AI

Introducing the Developer Knowledge API and MCP Server

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

Understanding Neural Network, Visually

I now assume that all ads on Apple news are scams

Why I Joined OpenAI

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

Female Asian Elephant Calf Born at the Smithsonian National Zoo

Learning from context is harder than we thought

FORTH? Really!?

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

The Waymo World Model

What Is Ruliology?

How we made geo joins 400× faster with H3 indexes

Jeffrey Snover: "Welcome to the Room"

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: I spent 4 years building a UI design tool with only the features I use

Sheldon Brown's Bicycle Technical Info

Microsoft open-sources LiteBox, a security-focused library OS

Hackers (1995) Animated Experience

Show HN: If you lose your memory, how to regain access to your computer?

An Update on Heroku

Dark Alley Mathematics

Vocal Guide – belt sing without killing yourself

Delimited Continuations vs. Lwt for Threads

PC Floppy Copy Protection: Vault Prolok

Start all of your commands with a comma

Was Benoit Mandelbrot a hedgehog or a fox?

How to effectively write quality code with AI

Introducing the Developer Knowledge API and MCP Server

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

Understanding Neural Network, Visually

I now assume that all ads on Apple news are scams

Why I Joined OpenAI

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

Female Asian Elephant Calf Born at the Smithsonian National Zoo

Learning from context is harder than we thought

FORTH? Really!?

Letting Claude play text adventures

Comments