frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Baba Is Eval

https://fi-le.net/baba/
131•fi-le•1d ago

Comments

kinduff•4h ago
Do you think the performance can be improved if the representation of the level is different?

I've seen AI struggle with ASCII, but when presented as other data structures, it performs better.

edit:

e.g. JSON with structured coordinates, graph based JSON, or a semantic representation with the coordinates

hajile•3h ago
If it struggles with the representation, that makes it an even better test of the AI's thinking potential.
RainyDayTmrw•3h ago
In the limit case, to an actual general intelligence, representation is superfluous, because it can figure out how to convert freely.

To the extent that the current generation of AI isn't general, yeah, papering over some of its weaknesses may allow you to expose other parts of it, both strengths and other weaknesses.

kadoban•9m ago
A human can easily struggle at solving a poorly communicated puzzle, especially if paper/pencil or something isn't available to convert to a better format. LLMs can look back at what they wrote, but it seems kind of like a poor format for working out a better representation to me.
k2xl•4h ago
Baba is You is a great game part of a collection of 2D grid puzzle games.

(Shameless plug: I am one of the developers of Thinky.gg (https://thinky.gg), which is a thinky puzzle game site for a 'shortest path style' [Pathology] and a Sokoban variant [Sokoath] )

These games are typically NP Hard so the typical techniques that solvers have employed for Sokoban (or Pathology) have been brute forced with varying heuristics (like BFS, dead-lock detection, and Zobrist hashing). However, once levels get beyond a certain size with enough movable blocks you end up exhausting memory pretty quickly.

These types of games are still "AI Proof" so far in that LLMs are absolutely awful at solving these while humans are very good (so seems reasonable to consider for for ARC-AGI benchmarks). Whenever a new reasoning model gets released I typically try it on some basic Pathology levels (like 'One at a Time' https://pathology.thinky.gg/level/ybbun/one-at-a-time) and they fail miserably.

Simple level code for the above level (1 is a wall, 2 is a movable block, 4 is starting block, 3 is the exit):

000

020

023

041

Similar to OP, I've found Claude couldn’t manage rule dynamics, blocked paths, or game objectives well and spits out random results.

kinduff•4h ago
In Factorio's paper [1] page 3, the agent receives a semantic representation with coordinates. Have you tried this data format?

[1]: https://arxiv.org/pdf/2503.09617

ekianjo•4h ago
this is definitely a case for fine tuning a LLM on this game's data. There is currently no LLM out there that is able to play very well many games of different kinds.
captn3m0•3h ago
I once made a “RC plays Baba Is You” that controlled the game over a single shared browser that was streaming video and controls back to the game. Was quite fun!

But I am fairly sure all of Baba Is You solutions are present in the training data for modern LLMs so it won’t make for a good eval.

chmod775•15m ago
> But I am fairly sure all of Baba Is You solutions are present in the training data for modern LLMs so it won’t make for a good eval.

Claude 4 cannot solve any Baba Is You level (except level 0 that is solved by 8 right inputs), so for now it's at least a nice low bar to shoot for...

RainyDayTmrw•2h ago
This is interesting. If you approach this game as individual moves, the search tree is really deep. However, most levels can be expressed as a few intermediate goals.

In some ways, this reminds me of the history of AI Go (board game). But the resolution there was MCTS, which wasn't at all what we wanted (insofar as MCTS is not generalizable to most things).

rtpg•2h ago
> However, most levels can be expressed as a few intermediate goals

I think generally the whole thing with puzzle games is that you have to determine the “right” intermediate goals. In fact, the naive intermediate goals are often entirely wrong!

A canonical sokoban-like inversion might be where you have to push two blocks into goal areas. You might think “ok, push one block into its goal area and then push another into it.”

But many of these games will have mechanisms meaning you would first want to push one block into its goal, then undo that for some reason (it might activate some extra functionality) push the other block, and then finally go back and do the thing.

There’s always weird tricks that mean that you’re going to walk backwards before walking forwards. I don’t think it’s impossible for these things to stumble into it, though. Just might spin a lot of cycles to get there (humans do too I guess)

matsemann•1h ago
Yeah, often working backwards and forwards at the same time is how to solve some advanced puzzle games. Then you keep it from exploding in options. When thinking backwards from the goal, you figure out constraints or "invariants" the forward path must uphold, thus can discard lots of dead ends earlier in your forward path.

To me, those discoveries are the fun part of most puzzle games. When you unlock the "trick" for each level and the dopamine flies, heh.

kadoban•14m ago
> But the resolution there was MCTS

MCTS wasn't _really_ the solution to go. MCTS-based AIs existed for years and they weren't _that_ good. They weren't superhuman for sure, and the moves/games they played were kind of boring.

The key to doing go well was doing something that vaguely looks like MCTS but the real guts are a network that can answer: "who's winning?" and "what are good moves to try here?" and using that to guide search. Additionally essential was realizing that computation (run search for a while) with a bad model could be effectively+efficiently used to generate better training data to train a better model.

pclmulqdq•1h ago
I have noticed a trend of the word "Desiderata" appearing in a lot more writing. Is this an LLM word or is it just in fashion? Most people would use the words "Deisres" or "Goals," so I assume this might be the new "delve."
Tomte•17m ago
It‘s academic jargon. Desiderata are often at the end of a paper, in the section „someone should investigate X, but I‘m moving on to the next funded project“.
wohoef•27m ago
In my experience LLMs have a hard time working with text grids like this. It seems to find columns harder to “detect” then rows. Probably because it’s input shows it as a giant row if that makes sense.

It has the same problem with playing chess. But I’m not sure if there is a datatype it could work with for this kinda game. Currently it seems more like LLMs can’t really work on spacial problems. But this should actually be something that can be fixed (pretty sure I saw an article about it on HN recently)

tibastral2•3m ago
It reminds me of https://en.m.wikipedia.org/wiki/The_Ricks_Must_Be_Crazy. Hope we are not ourselves in some sort of simulation ;)

OBBB signed: Reinstates immediate expensing for U.S.-based R&D

https://www.kbkg.com/feature/house-passes-tax-bill-sending-to-president-for-signature
287•tareqak•7h ago•159 comments

Baba Is Eval

https://fi-le.net/baba/
131•fi-le•1d ago•17 comments

Clarifying Our Pricing

https://cursor.com/en/blog/june-2025-pricing
29•twapi•3h ago•12 comments

Why AO3 Was Down

https://www.reddit.com/r/AO3/s/67nQid89MW
92•danso•5h ago•21 comments

N-Back – A Minimal, Adaptive Dual N-Back Game for Brain Training

https://n-back.net
14•gregzeng95•2d ago•6 comments

Being too ambitious is a clever form of self-sabotage

https://maalvika.substack.com/p/being-too-ambitious-is-a-clever-form
265•alihm•10h ago•74 comments

The messy reality of SIMD (vector) functions

https://johnnysswlab.com/the-messy-reality-of-simd-vector-functions/
8•mfiguiere•1h ago•0 comments

Telli (YC F24) Is Hiring Engineers [On-Site Berlin]

https://hi.telli.com/join-us
1•sebselassie•53m ago

Learn to love the moat of low status

https://usefulfictions.substack.com/p/learn-to-love-the-moat-of-low-status
109•jger15•2d ago•40 comments

Mini NASes marry NVMe to Intel's efficient chip

https://www.jeffgeerling.com/blog/2025/mini-nases-marry-nvme-intels-efficient-chip
343•ingve•16h ago•172 comments

The History of Electronic Music in 476 Tracks (1937–2001)

https://www.openculture.com/2025/06/the-history-of-electronic-music-in-476-tracks.html
29•bookofjoe•2d ago•5 comments

I convinced my K8s team to go AWS serverless. Spoiler, they didn't

https://medium.com/@dnsearching/how-i-convinced-my-k8s-team-to-go-aws-serverless-5104e880e7a4
12•gpi•2d ago•25 comments

How to Incapacitate Google Tag Manager and Why You Should (2022)

https://backlit.neocities.org/incapacitate-google-tag-manager
153•fsflover•13h ago•109 comments

Amiga Linux (1993)

https://groups.google.com/g/comp.sys.amiga.emulations/c/xUgrpylQOXk
16•marcodiego•4h ago•8 comments

EverQuest

https://www.filfre.net/2025/07/everquest/
212•dmazin•15h ago•104 comments

Why I left my tech job to work on chronic pain

https://sailhealth.substack.com/p/why-i-left-my-tech-job-to-work-on
312•glasscannon•19h ago•188 comments

Sleeping beauty Bitcoin wallets wake up after 14 years to the tune of $2B

https://www.marketwatch.com/story/sleeping-beauty-bitcoin-wallets-wake-up-after-14-years-to-the-tune-of-2-billion-79f1f11f
130•aorloff•13h ago•271 comments

Nvidia is full of shit

https://blog.sebin-nyshkim.net/posts/nvidia-is-full-of-shit/
595•todsacerdoti•9h ago•314 comments

Vortex (Véhicule Orbital Réutilisable de Transport Et D'Exploration)

https://www.dassault-aviation.com/en/space/aerospace-vehicles/vortex-vehicule-orbital-reutilisable-de-transport-et-dexploration/
20•ggm-at-algebras•3d ago•5 comments

Larry (cat)

https://en.wikipedia.org/wiki/Larry_(cat)
300•dcminter•22h ago•69 comments

In a milestone for Manhattan, a pair of coyotes has made Central Park their home

https://www.smithsonianmag.com/science-nature/in-a-milestone-for-manhattan-a-pair-of-coyotes-has-made-central-park-their-home-180986892/
124•sohkamyung•3d ago•119 comments

The story behind Caesar salad

https://www.nationalgeographic.com/travel/article/story-behind-caesar-salad
99•Bluestein•12h ago•48 comments

The Amiga 3000 Unix and Sun Microsystems: Deal or No Deal?

https://www.datagubbe.se/amix/
51•wicket•10h ago•7 comments

A new, 200% faster DeepSeek R1-0528 variant appears from German lab

https://venturebeat.com/ai/holy-smokes-a-new-200-faster-deepseek-r1-0528-variant-appears-from-german-lab-tng-technology-consulting-gmbh/
8•saubeidl•31m ago•0 comments

Robots move Shanghai city block [video]

https://www.youtube.com/watch?v=7ZccC9BnT8k
94•surprisetalk•1d ago•34 comments

The ITTAGE indirect branch predictor

https://blog.nelhage.com/post/ittage-branch-predictor/
38•Bogdanp•7h ago•11 comments

Show HN: I AI-coded a tower defense game and documented the whole process

https://github.com/maciej-trebacz/tower-of-time-game
241•M4v3R•19h ago•125 comments

Writing a Game Boy Emulator in OCaml

https://linoscope.github.io/writing-a-game-boy-emulator-in-ocaml/
231•ibobev•22h ago•44 comments

ADXL345 Die Analysis

https://www.tinytransistors.net/2024/08/25/adxl345/
28•picture•5h ago•0 comments

Wind Knitting Factory

https://www.merelkarhof.nl/work/wind-knitting-factory
237•bschne•1d ago•60 comments