Watching 03 Model Sweat over a Paul Morphy Mate-in-2

https://alexop.dev/posts/how-03-model-tries-chess-puzzle/

51•alexop•9h ago

Comments

awestroke•8h ago

O3 is massively underwhelming and is obviously tuned to be sycophantic.

Claude reigns supreme.

omneity•8h ago

This somehow reminds me of Agent-3 from [0].

0: https://ai-2027.com

tomduncalf•7h ago

Depends on the task I think. O3 is really effective at going off and doing research, try giving it a complex task which involves lots of browsing/searching and watch how it behaves. Claude cannot do anything like that right now. I do find O3’s tone of voice a bit odd

tough•8h ago

I've commited the 03 (zero-three) and not o3 (o-three) typo too, but can we rename it on the title please

sMarsIntruder•8h ago

So, are we talking about OpenAI o3 model, right?

alexop•8h ago

yes

bcraven•8h ago

>"When I gave OpenAI’s 03 model a tough chess puzzle..."

Opening sentence

monktastic1•7h ago

A little annoying that they use zero instead of o, but yeah.

janaagaard•7h ago

I was also confused. It looks like the article has been corrected, and now uses the familiar 'o3' name.

freediver•8h ago

On a similar note, I just updated LLM Chess Puzzles repo [1] yesterday.

The fact that gpt-4.5 gets 85% correctly solved is unexpected and somewhat scary (if model was not trained on this).

[1] https://github.com/kagisearch/llm-chess-puzzles

alexop•8h ago

Oh cool, I wonder how good 03 will be. While using 03, I noticed something funny: sometimes I gave it a screenshot without any position data. It ended up using Python and spent 10 minutes just trying to figure out where the figures were exactly.

Gimpei•6h ago

Given that o3 is trained on the contents of the Internet, and the answers to all these chess problems are almost certainly on the Internet in multiple places, in a sense it has been weakly trained on this content. The question for me becomes: is the LLM doing better on these problems because it’s improving in reasoning, or is it simply improving in information retrieval.

ttoinou•7h ago

Where does this obsession over giving binary logic tasks to LLMs come from ? New LLM breakthroughs are about handling blurry logic, non precise requirements and spitting vague human realistic outputs. Who care how well it can add integers or solve chess puzzles ? We have decades of computer science on those topics already

Arainach•7h ago

If we're going to call LLMs intelligent, they should be performant at these tasks as well.

ttoinou•6h ago

We called our computers intelligent and couldnt do so many things LLMs can do now easily.

But yeah calling them intelligent is a marketing trick that is very efficient

tgtweak•7h ago

I remember reading that got3.5-turbo instruct was oddly good at chess - would be curious what it outputs as a next two moves here.

Kapura•7h ago

So... it failed to solve the puzzle? That seems distinctly unimpressive, especially for a puzzle with a fixed start state and a limited set of possible moves.

IanCal•6h ago

> That seems distinctly unimpressive

I cannot understate how impressive this is to me, having been involved in ai research projects and robotics in years gone by.

This is a general purpose model, given an image and human written request that then step by step analyses the image, iterates through various options, tries to write code to solve the problem and then searches the internet for help. It reads multiple results and finds an answer, checks to validate it and then comes back to the user.

I had a robot that took ages to learn to plan tic tac toe by example and if the robot moved originally there was a solid chance it thought the entire world had changed and would freak out because it thought it might punch through the table.

This is also a chess puzzle marked as very hard that a person who is good at chess should give themselves fifteen minutes to solve. The author of the chess.com blog containing this puzzle only solved about half of them!

This is not an image analysis bot, it's not a chess bot, it's a general system I can throw bad english at.

alexop•6h ago

Yes, I agree. Like I said, in the end it did what a human would do: google for the answer. Still, it was interesting to see how the reasoning unfolded. Normally, humans train on these kinds of puzzles until they become pure pattern recognition. That's why you can't become a grandmaster if you only start learning chess as an adult — you need to be a kid and see thousands of these problems early on, until recognizing them becomes second nature. It's something humans are naturally very good at.

kamranjon•4h ago

I am a human and I figured this puzzle out in under a minute by just trying the small set of possible moves until I got it correct. I am not a serious chess player. I would have expected it to at least try the possible moves? I think this maybe lends credence to the idea that these models aren’t actually reasoning but are doing a great job of mimicking what we think humans do.

Kapura•5h ago

I am sorry, but if this impresses you you are a rube. If this were a machine with the smallest bit of actual intelligence it would, upon seeing its a chess puzzle, remember "hey, i am a COMPUTER and a small set of fixed moves should take me about 300ms or so to fully solve out" and then do that. If the machine _literally has to cheat to solve the puzzle_ then we have made technology that is, in fact, less capable than we created in the past.

"Well, it's not a chess engine so its impressive it-" No. Stop. At best what we have here is an extremely computationally expensive way to just google a problem. We've been googling things since I was literally a child. We've had voice search with google for, idk, a decade+. A computer that can't even solve its own chess problems is an expensive regression.

mhh__•5h ago

If you mean write code to exhaustively search the solution space then they actually can do that quite happily provided you tell it you will execute the code for them

bobsmooth•5h ago

A computer program that has the agency to google a problem, interpret the results, and respond to a human was science fiction just 10 years ago. The entire field of natural language processing has been solved and it's insane.

dimatura•1h ago

Honestly, I think that if in 2020 you had asked me whether we would be able to do this in 2025, I would've guessed no, with a fairly high confidence. And I was aware of GPT back then.

jncfhnb•4h ago

Looks to me like it would have simulated the steps using sensible tools but didn’t know it was sandboxed out of using those tools? I think that’s pretty reasonable.

Suppose we removed its ability to google and it conceded to doing the tedium of writing a chess engine to simulate the steps. Is that “better” for you?

currymj•4h ago

> "hey, i am a COMPUTER and a small set of fixed moves should take me about 300ms or so to fully solve out"

from the article:

"3. Attempt to Use Python When pure reasoning was not enough, o3 tried programming its way out of the situation.

“I should probably check using something like a chess engine to confirm.” (tries to import chess module, but fails: “ModuleNotFoundError”).

It wanted to run a simulation, but of course, it had no real chess engine installed."

this strategy failed, but if OpenAI were to add "pip install python-chess" to the environment, it very well might have worked. in any case, the machine did exactly the thing you claim it should have done.

possibly scrolling down to read the full article makes you a rube though.

andoando•5h ago

Im 1600 rated player and this took me 20 seconds to solve, is this really considered a very hard puzzle?

The obvious moves dont work, you can see whites pawn moving forward is mate, and you can see black is essentially trapped and has very limited moves, so immediately I thought first move is a waiting move and theres only two options there. Block the black pawn moving and if bishop moves, rook takes is mate. So rook has to block, and you can see bishop either moves or captures and pawn moving forward is mate

bubblyworld•4h ago

Agreed, I'm similar fide (not rated but ~2k lichess) and it took me a few seconds as well. Not a hard puzzle, for a regular chess player anyway.

BXLE_1-1-BitIs1•5h ago

Nice puzzle with a twist of Zugzwang. Took me about 8 minutes, but it's been decades since I was doing chess.

bfung•5h ago

LLMs are not chess engines, similar to how they don’t really calculate arithmetic. What’s new? carry on.

foundry27•1h ago

I just tried the same puzzle in o3 using the same image input, but tweaked the prompt to say “don’t use the search tool”. Very similar results!

It spent the first few minutes analyzing the image and cross-checking various slices of the image to make sure it understood the problem. Then it spent the next 6-7 minutes trying to work through various angles to the problem analytically. It decided this was likely a mate-in-two (part of the training data?), but went down the path that the key to solving the problem would be to convert the position to something more easily solvable first. At that point it started trying to pip install all sorts of chess-related packages, and when it couldn’t get that to work it started writing a simple chess solver in Python by hand (which didn’t work either). At one point it thought the script had found a mate-in-six that turned out to be due to a script bug, but I found it impressive that it didn’t just trust the script’s output - instead it analyzed the proposed solution and determined the nature of the bug in the script that caused it. Then it gave up and tried analyzing a bit more for five more minutes, at which point the thinking got cut off and displayed an internal error.

15 minutes total, didn’t solve the problem, but fascinating! There were several points where if the model were more “intelligent”, I absolutely could see it reasoning it out following the same steps.

Internet in a Box

New material gives copper superalloy-like strength

Show HN: I made a web-based, free alternative to Screen Studio

How a single line of code could brick your iPhone

Read the Obits

Show HN: I486SX_soft_FPU – Software FPU Emulator for NetBSD 10 on 486SX

AI Helps Find a Cause of Alzheimer's Disease and Identify Therapeutic Candidate

I just want to code (2023)

National Archives Releases Unidentified Anomalous Phenomena (UAP) Records

The suburban office park that launched Silicon Valley

Did 5G kill the IMSI catcher?

How a Pipe Organ Works (2020)

Computer Architects Can't Find the Average

Reverse geocoding is hard

Restoring a Sinclair C5

The coming knowledge-work supply-chain crisis

Virginia passes law to enforce maximum vehicle speeds for repeat speeders

Shardines: SQLite3 Database-per-Tenant with ActiveRecord

Cut: Chattanooga Civic User Testing

Business co-founders in tech startups are less valuable than they think

Tiny Emulators

Show HN: Daily Jailbreak – Prompt Engineer's Wordle

ZFS: Apple's new filesystem that wasn't (2016)

Ask HN: CS Degrees, do they matter again?

Extend (YC W23) is hiring engineers to build LLM document processing

Libogc (Wii homebrew library) discovered to contain code stolen from RTEMS

Show HN: I created snapDOM to capture DOM nodes as images with exceptional speed

Show HN: Bhvr, a Bun and Hono and Vite and React Starter

TmuxAI: AI-Powered, Non-Intrusive Terminal Assistant

In Memoriam: SF and Fine Artist David Schleinkofer