However, I must point out the kind of "modern" (relatively speaking) adventure games mentioned in the article -- which are more accurately called "interactive fiction" by the community -- is not very suitable for this kind of experiment. Why? Well, because so many of them are exploratory/experimental, and not at all about "winning" (unlike, say, "Colossal Cave Adventure", where there is a clear goal).
You cannot automate (via LLM) "playing" them, because they are all about the thoughts and emotions (and maybe shocked laughter) they elicit in human players. This cannot be automated.
If you think I'm being snobby, consider this: the first game TFA mentions is "9:05". Now, you can set goals for a bot to play this game, but truly -- if you've played the game -- you know this would be completely missing the point. You cannot "win" this game, it's all about subverting expectations, and about replaying it once you've seen the first, most straightforward ending, and having a laugh about it.
Saying more will spoil the game :)
(And do note there's no such thing as "spoiling a game" for an LLM, which is precisely the reason they cannot truly "play" these games!)
I'll grant you that 9:05 and For a Change are somewhat more modern: the former has easy puzzles, the latter very abstract puzzles.
I disagree new text adventures are not about puzzles and winning. They come in all kinds of flavours these days. Even games like 9:05 pace their narrative with traditional puzzles, meaning we can measure forward progress just the same. And to be fair, LLMs are so bad at these games that in these articles, I'm merely trying to get them to navigate the world at all.
If anything, I'd argue Adventure is a bad example of the genre you refer to. It was (by design) more of a caving simulator/sandbox with optinal loot than a game with progress toward a goal.
The idea was that it'd be good example of having to navigate somewhat foreign but internally consistent worlds, an essential text adventure skill.
The audience I had in mind when writing it was people who were already quite experienced in playing interactive fiction and could then be challenged in a new way while bringing their old skills to bear. So it's sort of a second-level game in that respect (so is 9:05, in different ways, as someone else mentioned).
I didn't use Adventure as an example of IF, it belongs in the older "text adventure" genre. Which is why I thought it would be more fitting to test LLMs, since it's not about experiences but about maxing points.
I think there's nothing to "solve" that an LLM can solve about IF. This genre of games, in its modern expression, is about breaking boundaries and expectations, and making the player enjoy this. Sometimes the fun is simply seeing different endings and how they relate to each other. Since LLMs cannot experience joy or surprise, and can only mechanically navigate the game (maybe "explore all possible end states" is a goal?), they cannot "play" it. Before you object: I'm aware you didn't claim the LLMs are really playing the game!
But here's a test for your set of LLMs: how would they "win" at "Rematch"? This game is about repeatedly dying, understanding what's happening, and stringing together a single sentence that will break the cycle and win the game. Can any LLM do this, a straightforward puzzle? I'd be impressed!
As for the specific question, they would progress at Rematch by figuring out ever more complicated interactions that work and will be used to survive, naturally.
This seems like begging the question to me.
I don't think there's a mechanistic (as in "token predictor") procedure to generate the emotions of having fun, or being surprised, or amazed. It's not on me to demonstrate it cannot be done, it's on them to demonstrate it can.
But to be clear, I don't think the author of TFA is making this claim either. They are simply approaching IF games from a "problem solving" perspective -- they don't claim this has anything to do with fun or AGI -- and what I'm arguing is that this mechanistic approach to IF games, i.e. "problem solving", only touches on a small subset of what makes people want to play these games. They are often (not all, as the author rightly corrects me, but often) about generating surprise and amazement in the player, something that cannot be done to an LLM.
(Note I'm also not dismissing the author's experiment. As an experiment it's interesting and, I'd argue, fun for the author).
Current, state of the art LLMs cannot feel amazement, or nothing else really (and, I argue, no LLM in the current tech branch will ever can). I hope this isn't a controversial statement.
The purpose of the test is whatever the tester decides it is. If that means finding X% of the ambiguously-good game endings within a budget of Y commands, then so be it.
Well, I did say:
> As an experiment, I cannot argue with this.
It was more a reflection on the fact that the primary goal of a lot of modern IF games, among which there is "9:05", the first game mentioned in TFA, is not like "traversing a mountain". Traversing a mountain can have clear and meaningful goals, such us "reach the summit", or "avoid getting stuck", or "do not die or go missing after X hours". Though of course, appreciating nature and sightseeing is beyond the scope of an LLM.
Indeed, "9:05" has no other "goal" than, upon seeing a different ending from the main one, revisiting the game with the knowledge gained from that first playthrough. I'm being purposefully opaque in order not to spoil the game for you (you should play it, it's really short).
Let me put it another way: remember that fad, some years ago, of making you pay attention to an image or video, with a prompt like "colorblind people cannot see this shape after X seconds" so you pay attention and then BAM! A jump scare! Haha, joke's on you!
How would you "test" a LLM on such jump scare? The goal is to scare a human. LLMs cannot be scared. What would the possible answers be?
A: I do not see any disappearing shapes after X seconds. Beep boop! I must not be colorblind, nor human, for I am an LLM. Beep!
or maybe
B: This is a well-known joke. Beep boop! After some short time, a monster appears on screen. This is intended to scare the person looking at it! Beep!
Would you say either response would show the LLM "playing" the game?
(Trust me, this is a somewhat adjacent effect to what "9:05" would play on you, and I fear I've said too much!)
And of course, there's no actual reasoning or logic going on, so they cannot compete in this context with a curious 12 year old, either.
If an LLM responds to “You are standing in an open field west of a white house” with “okay, I’m going to walk up to the house”, and just gets back “THAT SENTENCE ISN'T ONE I RECOGNIZE”, it’s not going to make much progress.
It is difficult here to separate out how much of this could be fixed or improved by better prompting. A better baseline might be to just give the LLM direct access to the text adventure, so that everything the LLM replies is given to the game directly. I suspect that the LLMs would do poorly on this task, but would undoubtedly improve over time and generations.
EDIT: Just started playing 9:05 with GPT-4 with no prompting and it did quite poorly; kept trying to explain to me what was going on with the ever more complex errors it would get. Put in a one line "You are playing a text adventure game" and off it went -- it took a shower and got dressed and drove to work.
[0] "Thinking, Fast and Slow" https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow
This is exactly why something like arc-agi-3 feels so important right now. Instead of static benchmarks that these models can basically brute force with enough training data, like designing around interactive environments where you actually need to perceive, decide, and act over multiple steps without prior instructions, that shift from "can you reproduce known patterns" to "can you figure out new patterns" seems like the real test of intelligence.
What's clever about the game environment approach is that it captures something fundamental about human intelligence that static benchmarks miss entirely, like, when humans encounter a new game, we explore, form plans, remember what worked, adjust our strategy all that interactive reasoning over time that these text adventure results show llms are terrible at, we need systems that can actually understand and adapt to new situations, not just really good autocomplete engines that happen to know a lot of trivia.
It's interesting he wrote a separate program to wrap the z-machine interpreter. I integrated my wrapper directly into my pytorch training program.
Otherwise, how can you determine that "North" is a context change, but not always a context change.
Part of what distinguishes humans from artificial "intelligence" to me is exactly that we automatically develop models of whatever is needed.
> real intelligence isn't just about having seen the answers before, it's about being good at games and specifically new situations where you can't just pattern match your way out
It is insane to me that so many people believe intelligence is measurable by pure question answer testing. There's hundreds of years of discussion about how this is limited in measuring human intelligence. I'm sure we all even know someone who's a really good test take but you also wouldn't consider to be really bright. I'm sure every single one of also knows someone in the other camp (bad at tests but considered bright).The definition you put down is much more agreed upon in the scientific literature. While we don't have a good formal definition of intelligence there is a difference between no definition. I really do hope people read more about intelligence and how we measure it in humans and animals. It is very messy and there's a lot of noise, but at least we have a good idea of the directions to move in. There's still nuances to be learned and while I think ARC is an important test, I don't think success on it will prove AGI (and Chollet says this too)
I find it funny that some AI do very good score on ARC-AI but fails at these games...
So from this it seems that not only would many of these requests not touch a reasoning model (or as it works now, have reasoning set to "minimal"?), but they're probably being routed to a mini or nano model?
It would make more sense, I think, to test on gpt-5 itself (and ideally the -mini and -nano as well), and perhaps with different reasoning effort, because that makes a big difference in many evals.
EDIT: Yeah the Chat router is busted big time. It fails to apply thinking even for problems that obviously call for it (analyzing financial reports). You have to add "Think hard." to the end of the prompt, or explicitly switch to the Thinking model in the UI.
I don't really get this gripe? It seems no different than before, except now it will sometimes opt into thinking harder by itself. If you know you want CoT reasoning you just select gpt5-thinking, no different than choosing o4-mini/o3 like before.
I've found for text adventures based on item manipulation, variations of the same puzzles appear again and again because there's a limit to how many obscure but not too obscure item puzzles you can come up with, so training would be good for exact matches of the same puzzle, and variations, like different ways of opening locked doors.
Puzzles like key + door, crowbar + panel, dog + food, coin + vending machine, vampire + garlic etc. You can obscure or layer puzzles, like changing the garlic into garlic bread which would still work on the vampire, so there's a logical connections to make but often nothing too crazy.
A lot of the difficulty in these games comes from not noticing or forgetting about clues/hints and potential puzzles because there's so much going on, which is less likely to trip up a computer.
You can already ask LLMs "in a game: 20 ways to open a door if I don't have the key", "how to get past an angry guard dog" or "I'm carrying X, Y, and Z, how do I open a door", and it'll list lots of ways that are seen in games, so it's going to be good at matching that with the current list of objects you're carrying, items in the world, and so on.
Another comment mentions about how the AI needs a world model that's transforming as actions are performed, but you need something similar to reason about maths proofs and code, where you have to keep track of the current state/context. And most adventure games don't require you to plan many steps in advance anyway. They're often about figuring out which item to combine/use with which other item next (where only one combination works), and navigating to the room that contains the latter item first.
So it feels like most of the parts are already there to me, and it's more about getting the right prompts and presenting the world in the right format e.g. maintaining a table of items, clues, and open puzzles, to look for connections and matches, and maintaining a map.
Getting LLMs to get good at variations of The Witness would be interesting, where the rules have to be learned through trial and error, and combined.
They just can't seem to grasp what would make a choice a "wrong" choice in a text-based adventure game, so they end up having no ending. You have to hard-code failure events, or you just never get anything like "you chose to attack the wizard, but he's level 99, dummy, so you died - game over!". It just accepts whatever choice you make, ad infinitum.
My best session was one in which I had the AI give me 4 dialogue options to choose from. I never "beat" the game, and we never solved the mystery - it just kept going further down the rabbit hole.. But it was surprisingly enjoyable, and repayable! A larger framework just needs written for it to keep the tires between the lines and to hard-code certain game rules - what's under the hood is already quite good for narratives imo.
https://github.com/derekburgess/dungen
It's a configurable pipeline for generative dungeon master role play content with a zork-like UI. I use a model called "Wayfarer" which is designed for challenging role play content and I find that it can be pretty fun to engage with.
Generally speaking, people play games for fun, and I suspect that will continue. Even if an LLM can beat all humans at computer games, it doesn't matter. We will continue to enjoy playing them. Computers, pre-LLM, could already out-play humans in many cases.
Other activities mentioned -- writing, art, coding, etc. -- can indeed be fun, but they are also activities that people have been paid to do. It seems that there is incentive to create LLMs that can do an at least adequate job of these tasks for less money than humans are paid, so that that money is rerouted to LLM companies instead of human workers. I imagine humans will continue to write, create art, and even code, without any financial incentive, though probably less.
(I personally remain unpersuaded that LLMs will do away with paid creative work altogether, but there's clearly a lot of interest in trying to maximize what LLMs can do.)
throwawayoldie•10h ago