Evaluating LLMs Playing Text Adventures

https://entropicthoughts.com/evaluating-llms-playing-text-adventures

90•todsacerdoti•10h ago

Comments

throwawayoldie•10h ago

My takeaway is: LLMs are not great at text adventures, even when those text adventures are decades old and have multiple walkthroughs available on the Internet. Slow clap.

ForHackernews•10h ago

What blogging software is this with the sidenotes?

hombre_fatal•10h ago

Noticed it was written in org mode with custom css so I found this post on their site: https://entropicthoughts.com/new-and-improved-now-powered-by...

kqr•7h ago

Some details of the side notes in particular are given here: https://entropicthoughts.com/sidenotes-footnotes-inlinenotes

the_af•10h ago

I know they define "achievements" in order to measure "how well" the LLM plays the game, and by definition this is arbitrary. As an experiment, I cannot argue with this.

However, I must point out the kind of "modern" (relatively speaking) adventure games mentioned in the article -- which are more accurately called "interactive fiction" by the community -- is not very suitable for this kind of experiment. Why? Well, because so many of them are exploratory/experimental, and not at all about "winning" (unlike, say, "Colossal Cave Adventure", where there is a clear goal).

You cannot automate (via LLM) "playing" them, because they are all about the thoughts and emotions (and maybe shocked laughter) they elicit in human players. This cannot be automated.

If you think I'm being snobby, consider this: the first game TFA mentions is "9:05". Now, you can set goals for a bot to play this game, but truly -- if you've played the game -- you know this would be completely missing the point. You cannot "win" this game, it's all about subverting expectations, and about replaying it once you've seen the first, most straightforward ending, and having a laugh about it.

Saying more will spoil the game :)

(And do note there's no such thing as "spoiling a game" for an LLM, which is precisely the reason they cannot truly "play" these games!)

kqr•9h ago

I disagree. Lockout, Dreamhold, Lost Pig, and So Far are new games but in the old style. Plundered Hearts is literally one of the old games (though ahead of its time).

I'll grant you that 9:05 and For a Change are somewhat more modern: the former has easy puzzles, the latter very abstract puzzles.

I disagree new text adventures are not about puzzles and winning. They come in all kinds of flavours these days. Even games like 9:05 pace their narrative with traditional puzzles, meaning we can measure forward progress just the same. And to be fair, LLMs are so bad at these games that in these articles, I'm merely trying to get them to navigate the world at all.

If anything, I'd argue Adventure is a bad example of the genre you refer to. It was (by design) more of a caving simulator/sandbox with optinal loot than a game with progress toward a goal.

dfan•9h ago

As the author of For A Change, I am astonished that anyone would think it was a good testbed for an LLM text adventure solver. It's fun that they tried, though.

kqr•9h ago

Thank you for making it. The imagery of it is striking and comes back to me every now and then. I cannot unhear "a high wall is not high to be measured in units of length, but of angle" -- beautifully put.

The idea was that it'd be good example of having to navigate somewhat foreign but internally consistent worlds, an essential text adventure skill.

dfan•8h ago

Ha, I didn't realize that I was replying to the person who wrote the post!

The audience I had in mind when writing it was people who were already quite experienced in playing interactive fiction and could then be challenged in a new way while bringing their old skills to bear. So it's sort of a second-level game in that respect (so is 9:05, in different ways, as someone else mentioned).

the_af•9h ago

We will have to agree to disagree, if you'll allow me the cliche.

I didn't use Adventure as an example of IF, it belongs in the older "text adventure" genre. Which is why I thought it would be more fitting to test LLMs, since it's not about experiences but about maxing points.

I think there's nothing to "solve" that an LLM can solve about IF. This genre of games, in its modern expression, is about breaking boundaries and expectations, and making the player enjoy this. Sometimes the fun is simply seeing different endings and how they relate to each other. Since LLMs cannot experience joy or surprise, and can only mechanically navigate the game (maybe "explore all possible end states" is a goal?), they cannot "play" it. Before you object: I'm aware you didn't claim the LLMs are really playing the game!

But here's a test for your set of LLMs: how would they "win" at "Rematch"? This game is about repeatedly dying, understanding what's happening, and stringing together a single sentence that will break the cycle and win the game. Can any LLM do this, a straightforward puzzle? I'd be impressed!

kqr•7h ago

I think I see what you mean and with these clarifications we are in agreement. There is a lot of modern works of interactive fiction that goes way beyond what the old text adventures did, and work even when judged as art or literature. I just haven't played much of it because I'm a fan of the old-style games.

As for the specific question, they would progress at Rematch by figuring out ever more complicated interactions that work and will be used to survive, naturally.

fmbb•8h ago

Of course you can automate ”having fun” and ”being entertained”. That is if you believe humanity will ever build artificial intelligence.

drdeca•8h ago

A p-zombie would not have fun or be entertained, only act like it does. I don’t think AGI requires being unlike a p-zombie in this way.

the_af•8h ago

> Of course you can automate ”having fun” and ”being entertained”

This seems like begging the question to me.

I don't think there's a mechanistic (as in "token predictor") procedure to generate the emotions of having fun, or being surprised, or amazed. It's not on me to demonstrate it cannot be done, it's on them to demonstrate it can.

But to be clear, I don't think the author of TFA is making this claim either. They are simply approaching IF games from a "problem solving" perspective -- they don't claim this has anything to do with fun or AGI -- and what I'm arguing is that this mechanistic approach to IF games, i.e. "problem solving", only touches on a small subset of what makes people want to play these games. They are often (not all, as the author rightly corrects me, but often) about generating surprise and amazement in the player, something that cannot be done to an LLM.

(Note I'm also not dismissing the author's experiment. As an experiment it's interesting and, I'd argue, fun for the author).

Current, state of the art LLMs cannot feel amazement, or nothing else really (and, I argue, no LLM in the current tech branch will ever can). I hope this isn't a controversial statement.

Terr_•7h ago

That's like saying it's wrong to test a robot's ability to navigate and traverse a mountain... because the mountain has no win-condition and is really a context for human emotional experiences.

The purpose of the test is whatever the tester decides it is. If that means finding X% of the ambiguously-good game endings within a budget of Y commands, then so be it.

the_af•7h ago

> The purpose of the test is whatever the tester decides it is.

Well, I did say:

> As an experiment, I cannot argue with this.

It was more a reflection on the fact that the primary goal of a lot of modern IF games, among which there is "9:05", the first game mentioned in TFA, is not like "traversing a mountain". Traversing a mountain can have clear and meaningful goals, such us "reach the summit", or "avoid getting stuck", or "do not die or go missing after X hours". Though of course, appreciating nature and sightseeing is beyond the scope of an LLM.

Indeed, "9:05" has no other "goal" than, upon seeing a different ending from the main one, revisiting the game with the knowledge gained from that first playthrough. I'm being purposefully opaque in order not to spoil the game for you (you should play it, it's really short).

Let me put it another way: remember that fad, some years ago, of making you pay attention to an image or video, with a prompt like "colorblind people cannot see this shape after X seconds" so you pay attention and then BAM! A jump scare! Haha, joke's on you!

How would you "test" a LLM on such jump scare? The goal is to scare a human. LLMs cannot be scared. What would the possible answers be?

A: I do not see any disappearing shapes after X seconds. Beep boop! I must not be colorblind, nor human, for I am an LLM. Beep!

or maybe

B: This is a well-known joke. Beep boop! After some short time, a monster appears on screen. This is intended to scare the person looking at it! Beep!

Would you say either response would show the LLM "playing" the game?

(Trust me, this is a somewhat adjacent effect to what "9:05" would play on you, and I fear I've said too much!)

benlivengood•9h ago

Wouldn't playthroughs for these games be potentially in the pretraining corpus for all of these models?

throwawayoldie•9h ago

As a longtime IF fan, I can basically guarantee there are.

quesera•8h ago

Reproducing specific chunks of long form text from distilled (inherently lossy) model data is not something that I would expect LLMs to be good at.

And of course, there's no actual reasoning or logic going on, so they cannot compete in this context with a curious 12 year old, either.

jameshart•9h ago

Nothing in the article mentioned how good the LLMs were at even entering valid text adventure commands into the games.

If an LLM responds to “You are standing in an open field west of a white house” with “okay, I’m going to walk up to the house”, and just gets back “THAT SENTENCE ISN'T ONE I RECOGNIZE”, it’s not going to make much progress.

throwawayoldie•9h ago

"You're absolutely right, that's not a sentence you recognize..."

kqr•7h ago

The previous article (linked in this one) gives an idea of that.

jameshart•7h ago

I did see that. But since that focused really on how Claude handled that particular prompt format, it’s not clear whether the LLMs that scored low here were just failing at producing valid input, struggled to handle that specific prompt/output structure, or were doing fine at basically operating the text adventure but were struggling at building a world model and problem solving.

kqr•7h ago

Ah, I see what you mean. Yeah, there was too much output from too many models at once (combined with not enough spare time) to really perform useful qualitative analysis on all the models' performance.

fzzzy•9h ago

I tried this earlier this year. I wrote a tool that let an llm play Zork. It was pretty fun.

bongodongobob•8h ago

Did you do anything special? I tried this with just copy and paste with GPT-4o and it was absolutely terrible at it. It usually ended up spamming help in a loop and trying commands that didn't exist.

fzzzy•4h ago

I have my own agent loop that I wrote, and I gave it a tool which it uses to send input to the parser. I also had a step which took the previous output and generated an image for it. It was just a toy, but it was pretty fun.

lottaFLOPS•9h ago

related research that was also announced this week: https://www.textquests.ai/

1970-01-01•9h ago

Very interesting how they all clearly suck at it. Even with hints, they can't understand the task enough to complete the game.

abraxas•8h ago

that's a great tracker. How often is the laderboard updated?

kqr•7h ago

They seem to be going for a much simpler route of just giving the LLM a full transcript of the game with its own reasoning interspersed. I didn't have much luck with that, and I'm worried it might not be effective once we're into the hundreds of turns because of inadvertent context poisoning. It seems like this might indeed be what happens, given the slowing of progress indicated in the paper.

andrewla•9h ago

The article links to a previous article discussing methodology for this. The prompting is pretty extensive.

It is difficult here to separate out how much of this could be fixed or improved by better prompting. A better baseline might be to just give the LLM direct access to the text adventure, so that everything the LLM replies is given to the game directly. I suspect that the LLMs would do poorly on this task, but would undoubtedly improve over time and generations.

EDIT: Just started playing 9:05 with GPT-4 with no prompting and it did quite poorly; kept trying to explain to me what was going on with the ever more complex errors it would get. Put in a one line "You are playing a text adventure game" and off it went -- it took a shower and got dressed and drove to work.

SquibblesRedux•8h ago

This is another great example of how LLMs are not really any sort of AI, or even proper knowledge representation. Not saying they don't have their uses (like souped up search and permutation generators), but definitely not something that resembles intelligence.

nonethewiser•8h ago

While I agree, it's still shocking how far next token prediction gets us to looking like intelligence. It's amazing we need examples such as this to demonstrate it.

SquibblesRedux•8h ago

Another way to think about it is how interesting it is that humans can be so easily influenced by strings of words. (Or images, or sounds.) I suppose I would characterize it as so many people being earnestly vulnerable. It all makes me think of Kahneman's [0] System 1 (fast) and System 2 (slow) thinking.

[0] "Thinking, Fast and Slow" https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow

seba_dos1•4h ago

It is kinda shocking, but I'm sure ELIZA was too for many people back then. It just took shorter to realize what was going on there.

henriquegodoy•8h ago

Looking at this evaluation it's pretty fascinating how badly these models perform even on decades old games that almost certainly have walkthroughs scattered all over their training data. Like, you'd think they'd at least brute force their way through the early game mechanics by now, but honestly this kinda validates something I've been thinking about like real intelligence isn't just about having seen the answers before, it's about being good at games and specifically new situations where you can't just pattern match your way out

This is exactly why something like arc-agi-3 feels so important right now. Instead of static benchmarks that these models can basically brute force with enough training data, like designing around interactive environments where you actually need to perceive, decide, and act over multiple steps without prior instructions, that shift from "can you reproduce known patterns" to "can you figure out new patterns" seems like the real test of intelligence.

What's clever about the game environment approach is that it captures something fundamental about human intelligence that static benchmarks miss entirely, like, when humans encounter a new game, we explore, form plans, remember what worked, adjust our strategy all that interactive reasoning over time that these text adventure results show llms are terrible at, we need systems that can actually understand and adapt to new situations, not just really good autocomplete engines that happen to know a lot of trivia.

msgodel•8h ago

I've been experimenting with this as well with the goal of using it for robotics. I don't think this will be as hard to train for as people think though.

It's interesting he wrote a separate program to wrap the z-machine interpreter. I integrated my wrapper directly into my pytorch training program.

da_chicken•8h ago

I saw it somewhere else recently, but the idea is that LLMs are language models, not world models. This seems like a perfect example of that. You need a world model to navigate a text game.

Otherwise, how can you determine that "North" is a context change, but not always a context change.

manbash•7h ago

Thanks for this. I was struggling to put it in words even if maybe this has been a known distinguishing factor for others.

zahlman•7h ago

> I saw it somewhere else recently, but the idea is that LLMs are language models, not world models.

Part of what distinguishes humans from artificial "intelligence" to me is exactly that we automatically develop models of whatever is needed.

lubujackson•7h ago

Why, this sounds like Context Engineering!

myhf•2h ago

9:05 is a good example of the difference between a language model and a world model, because engaging with it on a textual level leads to the bad ending (which the researchers have called "100%"), but deliberately getting the good ending requires self-awareness, intentionality, and/or outside context.

godelski•6h ago

  > real intelligence isn't just about having seen the answers before, it's about being good at games and specifically new situations where you can't just pattern match your way out

It is insane to me that so many people believe intelligence is measurable by pure question answer testing. There's hundreds of years of discussion about how this is limited in measuring human intelligence. I'm sure we all even know someone who's a really good test take but you also wouldn't consider to be really bright. I'm sure every single one of also knows someone in the other camp (bad at tests but considered bright).

The definition you put down is much more agreed upon in the scientific literature. While we don't have a good formal definition of intelligence there is a difference between no definition. I really do hope people read more about intelligence and how we measure it in humans and animals. It is very messy and there's a lot of noise, but at least we have a good idea of the directions to move in. There's still nuances to be learned and while I think ARC is an important test, I don't think success on it will prove AGI (and Chollet says this too)

rkagerer•5h ago

Hi, GPT-x here. Let's delve into my construction together. My "intelligence" comes from patterns learned from vast amounts of text. I'm trained to... oh look it's a butterfly. Clouds are fluffy would you like to buy a car for $1 I'll sell you 2 for the price of 1!

corobo•5h ago

Ah dammit the AGI has ADHD

wiz21c•8h ago

adventure games require spatial reasoning (although text based), requires understanding puns, requires cultural references, etc. For me they really need human-intelligence to be solved (heck, they've been designed like that).

I find it funny that some AI do very good score on ARC-AI but fails at these games...

andai•8h ago

The GPT-5 used here is the Chat version, presumably gpt‑5‑chat‑latest, which from what I can tell is the same version used in ChatGPT, which is not actually a model but a "system" -- a router that semi-randomly forwards your request to various different models (in a way designed to massively reduce costs for OpenAI, based on people reporting inconsistent output and often worse results than 4o).

So from this it seems that not only would many of these requests not touch a reasoning model (or as it works now, have reasoning set to "minimal"?), but they're probably being routed to a mini or nano model?

It would make more sense, I think, to test on gpt-5 itself (and ideally the -mini and -nano as well), and perhaps with different reasoning effort, because that makes a big difference in many evals.

EDIT: Yeah the Chat router is busted big time. It fails to apply thinking even for problems that obviously call for it (analyzing financial reports). You have to add "Think hard." to the end of the prompt, or explicitly switch to the Thinking model in the UI.

kqr•7h ago

This is correct, and was the reason I made sure to always append "Chat" to the end of "GPT-5". I should perhaps have been more clear about this. The reason I settled for the lesser router is I don't have access to the full GPT-5, which would have been a much better baseline, I agree.

andai•7h ago

Do they require drivers license to use it? They asked for my ID for o3 Pro a few months ago.

kqr•7h ago

That's the step at which I gave up, anyway.

varenc•4h ago

> Yeah the Chat router is busted big time... You have to add "Think hard." to the end of the prompt, or explicitly switch to the Thinking model in the UI.

I don't really get this gripe? It seems no different than before, except now it will sometimes opt into thinking harder by itself. If you know you want CoT reasoning you just select gpt5-thinking, no different than choosing o4-mini/o3 like before.

seanwilson•7h ago

I won't be surprised when LLMs get good at puzzle-heavy text adventures if there was more attention turned to this.

I've found for text adventures based on item manipulation, variations of the same puzzles appear again and again because there's a limit to how many obscure but not too obscure item puzzles you can come up with, so training would be good for exact matches of the same puzzle, and variations, like different ways of opening locked doors.

Puzzles like key + door, crowbar + panel, dog + food, coin + vending machine, vampire + garlic etc. You can obscure or layer puzzles, like changing the garlic into garlic bread which would still work on the vampire, so there's a logical connections to make but often nothing too crazy.

A lot of the difficulty in these games comes from not noticing or forgetting about clues/hints and potential puzzles because there's so much going on, which is less likely to trip up a computer.

You can already ask LLMs "in a game: 20 ways to open a door if I don't have the key", "how to get past an angry guard dog" or "I'm carrying X, Y, and Z, how do I open a door", and it'll list lots of ways that are seen in games, so it's going to be good at matching that with the current list of objects you're carrying, items in the world, and so on.

Another comment mentions about how the AI needs a world model that's transforming as actions are performed, but you need something similar to reason about maths proofs and code, where you have to keep track of the current state/context. And most adventure games don't require you to plan many steps in advance anyway. They're often about figuring out which item to combine/use with which other item next (where only one combination works), and navigating to the room that contains the latter item first.

So it feels like most of the parts are already there to me, and it's more about getting the right prompts and presenting the world in the right format e.g. maintaining a table of items, clues, and open puzzles, to look for connections and matches, and maintaining a map.

Getting LLMs to get good at variations of The Witness would be interesting, where the rules have to be learned through trial and error, and combined.

standardly•5h ago

LLMs work really well for open-ended role-playing sessions, but not so much games with strict rules.

They just can't seem to grasp what would make a choice a "wrong" choice in a text-based adventure game, so they end up having no ending. You have to hard-code failure events, or you just never get anything like "you chose to attack the wizard, but he's level 99, dummy, so you died - game over!". It just accepts whatever choice you make, ad infinitum.

My best session was one in which I had the AI give me 4 dialogue options to choose from. I never "beat" the game, and we never solved the mystery - it just kept going further down the rabbit hole.. But it was surprisingly enjoyable, and repayable! A larger framework just needs written for it to keep the tires between the lines and to hard-code certain game rules - what's under the hood is already quite good for narratives imo.

spacecadet•4h ago

Ill pump my repo, DUNGEN.

https://github.com/derekburgess/dungen

It's a configurable pipeline for generative dungeon master role play content with a zork-like UI. I use a model called "Wayfarer" which is designed for challenging role play content and I find that it can be pretty fun to engage with.

gibbitz•2h ago

This study raises the question, why do we play games? Do we play to win or to enjoy ourselves. Why design a machine to do what we should be enjoying? This goes for writing, creating Art, coding. Wanting a machine to win is the desire to achieve a goal without doing the work to earn it. Same for making art or writing novels. The point of these things (growth and achievement) is lost when done by a machine. I want to see this done with investment, legal strategy or business management. These are better suited to LLMs than what we're making them do, but I'd venture that those who are profiting from LLMs right now would profit less if replaced by LLMs by their boards.

tjr•1h ago

I imagine that pitting LLMs against computer games is itself an enjoyable activity.

Generally speaking, people play games for fun, and I suspect that will continue. Even if an LLM can beat all humans at computer games, it doesn't matter. We will continue to enjoy playing them. Computers, pre-LLM, could already out-play humans in many cases.

Other activities mentioned -- writing, art, coding, etc. -- can indeed be fun, but they are also activities that people have been paid to do. It seems that there is incentive to create LLMs that can do an at least adequate job of these tasks for less money than humans are paid, so that that money is rerouted to LLM companies instead of human workers. I imagine humans will continue to write, create art, and even code, without any financial incentive, though probably less.

(I personally remain unpersuaded that LLMs will do away with paid creative work altogether, but there's clearly a lot of interest in trying to maximize what LLMs can do.)

Claude Sonnet 4 now supports 1M tokens of context

Search all text in New York City

Ashet Home Computer

Show HN: Building a web search engine from scratch with 3B neural embeddings

Scapegoating the Algorithm

Journaling using Nix, Vim and coreutils

Training language models to be warm and empathetic makes them less reliable

A gentle introduction to anchor positioning

Show HN: Omnara – Run Claude Code from anywhere

AI Eroded Doctors' Ability to Spot Cancer Within Months in Study

Multimodal WFH setup: flight SIM, EE lab, and music studio in 60sqft/5.5M²

Blender is Native on Windows 11 on Arm

WHY2025: How to become your own ISP [video]

The Missing Protocol: Let Me Know

Launch HN: Design Arena (YC S25) – Head-to-head AI benchmark for aesthetics

LLMs aren't world models

Go 1.25 Release Notes

Why are there so many rationalist cults?

The Equality Delete Problem in Apache Iceberg

RISC-V single-board computer for less than 40 euros

Debian GNU/Hurd 2025 released

Dumb to managed switch conversion (2010)

Visualizing quaternions, an explorable video series

Weave (YC W25) is hiring a founding AI engineer

Fixing a loud PSU fan without dying

Galileo’s telescopes: Seeing is believing (2010)

Nexus: An Open-Source AI Router for Governance, Control and Observability

Australian court finds Apple, Google guilty of being anticompetitive

How to safely escape JSON inside HTML SCRIPT elements

Comparing baseball greats across eras, who comes out on top?

Claude Sonnet 4 now supports 1M tokens of context

Search all text in New York City

Ashet Home Computer

Show HN: Building a web search engine from scratch with 3B neural embeddings

Scapegoating the Algorithm

Journaling using Nix, Vim and coreutils

Training language models to be warm and empathetic makes them less reliable

A gentle introduction to anchor positioning

Show HN: Omnara – Run Claude Code from anywhere

AI Eroded Doctors' Ability to Spot Cancer Within Months in Study

Multimodal WFH setup: flight SIM, EE lab, and music studio in 60sqft/5.5M²

Blender is Native on Windows 11 on Arm

WHY2025: How to become your own ISP [video]

The Missing Protocol: Let Me Know

Launch HN: Design Arena (YC S25) – Head-to-head AI benchmark for aesthetics

LLMs aren't world models

Go 1.25 Release Notes

Why are there so many rationalist cults?

The Equality Delete Problem in Apache Iceberg

RISC-V single-board computer for less than 40 euros

Debian GNU/Hurd 2025 released

Dumb to managed switch conversion (2010)

Visualizing quaternions, an explorable video series

Weave (YC W25) is hiring a founding AI engineer

Fixing a loud PSU fan without dying

Galileo’s telescopes: Seeing is believing (2010)

Nexus: An Open-Source AI Router for Governance, Control and Observability

Australian court finds Apple, Google guilty of being anticompetitive

How to safely escape JSON inside HTML SCRIPT elements

Comparing baseball greats across eras, who comes out on top?

Evaluating LLMs Playing Text Adventures

Comments