Opening sentence
The fact that gpt-4.5 gets 85% correctly solved is unexpected and somewhat scary (if model was not trained on this).
But yeah calling them intelligent is a marketing trick that is very efficient
I cannot understate how impressive this is to me, having been involved in ai research projects and robotics in years gone by.
This is a general purpose model, given an image and human written request that then step by step analyses the image, iterates through various options, tries to write code to solve the problem and then searches the internet for help. It reads multiple results and finds an answer, checks to validate it and then comes back to the user.
I had a robot that took ages to learn to plan tic tac toe by example and if the robot moved originally there was a solid chance it thought the entire world had changed and would freak out because it thought it might punch through the table.
This is also a chess puzzle marked as very hard that a person who is good at chess should give themselves fifteen minutes to solve. The author of the chess.com blog containing this puzzle only solved about half of them!
This is not an image analysis bot, it's not a chess bot, it's a general system I can throw bad english at.
"Well, it's not a chess engine so its impressive it-" No. Stop. At best what we have here is an extremely computationally expensive way to just google a problem. We've been googling things since I was literally a child. We've had voice search with google for, idk, a decade+. A computer that can't even solve its own chess problems is an expensive regression.
Suppose we removed its ability to google and it conceded to doing the tedium of writing a chess engine to simulate the steps. Is that “better” for you?
from the article:
"3. Attempt to Use Python When pure reasoning was not enough, o3 tried programming its way out of the situation.
“I should probably check using something like a chess engine to confirm.” (tries to import chess module, but fails: “ModuleNotFoundError”).
It wanted to run a simulation, but of course, it had no real chess engine installed."
this strategy failed, but if OpenAI were to add "pip install python-chess" to the environment, it very well might have worked. in any case, the machine did exactly the thing you claim it should have done.
possibly scrolling down to read the full article makes you a rube though.
The obvious moves dont work, you can see whites pawn moving forward is mate, and you can see black is essentially trapped and has very limited moves, so immediately I thought first move is a waiting move and theres only two options there. Block the black pawn moving and if bishop moves, rook takes is mate. So rook has to block, and you can see bishop either moves or captures and pawn moving forward is mate
It spent the first few minutes analyzing the image and cross-checking various slices of the image to make sure it understood the problem. Then it spent the next 6-7 minutes trying to work through various angles to the problem analytically. It decided this was likely a mate-in-two (part of the training data?), but went down the path that the key to solving the problem would be to convert the position to something more easily solvable first. At that point it started trying to pip install all sorts of chess-related packages, and when it couldn’t get that to work it started writing a simple chess solver in Python by hand (which didn’t work either). At one point it thought the script had found a mate-in-six that turned out to be due to a script bug, but I found it impressive that it didn’t just trust the script’s output - instead it analyzed the proposed solution and determined the nature of the bug in the script that caused it. Then it gave up and tried analyzing a bit more for five more minutes, at which point the thinking got cut off and displayed an internal error.
15 minutes total, didn’t solve the problem, but fascinating! There were several points where if the model were more “intelligent”, I absolutely could see it reasoning it out following the same steps.
awestroke•8h ago
Claude reigns supreme.
omneity•8h ago
0: https://ai-2027.com
tomduncalf•7h ago