The new Opus 4.6 scores 65.4 on Terminal-Bench 2.0, up from 64.7 from GPT-5.2-codex.
GPT-5.3-codex scores 77.3.
That said ... I do think Codex 5.2 was the best coding model for more complex tasks, albeit quite slow.
So very much looking forward to trying out 5.3.
You can take off your tinfoil hat. The same models can perform differently depending on the programming language, frameworks and libraries employed, and even project. Also, context does matter, and a model's output greatly varies depending on your prompt history.
AI agents, perhaps? :-D
Every new model overfits to the latest overhyped benchmark.
Someone should take this to a logical extreme and train a tiny model that scores better on a specific benchmark.
It's not just over-fitting to leading benchmarks, there's also too many degrees of freedom in how a model is tested (harness, etc). Until there's standardized documentation enabling independent replication, it's all just benchmarketing .
I use 5.2 Codex for the entire task, then ask Opus 4.5 at the end to double check the work. It's nice to have another frontier model's opinion and ask it to spot any potential issues.
Looking forward to trying 5.3.
Hopefully performance will pick up after the rollout.
Cost to Run Artificial Analysis Intelligence Index:
GPT-5.2 Codex (xhigh): $3244
Claude Opus 4.5-reasoning: $1485
(and probably similar values for the newer models?)
Not throwing shade anyone's way. I actually do prefer Claude for webdev (even if it does cringe things like generate custom CSS on every page) -- because I hate webdev and Claude designs are always better looking.
But the meat of my code is backend and "hard" and for that Codex is always better, not even a competition. In that domain, I want accuracy and not speed.
Solution, use both as needed!
I really do wonder whats the chain here. Did Sam see the Opus announcement and DM someone a minute later?
GPT-4o vs. Google I/O (May 2024): OpenAI scheduled its "Spring Update" exactly 24 hours before Google’s biggest event of the year, Google I/O. They launched GPT-4o voice mode.
Sora vs. Gemini 1.5 Pro (Feb 2024): Just two hours after Google announced its breakthrough Gemini 1.5 Pro model, Sam Altman tweeted the reveal of Sora (text-to-video).
ChatGPT Enterprise vs. Google Cloud Next (Aug 2023): As Google began its major conference focused on selling AI to businesses, OpenAI announced ChatGPT Enterprise.
This is hilarious lol
In case you missed it. For example:
Nvidia's $100 billion OpenAI deal has seemingly vanished - Ars Technica
https://arstechnica.com/information-technology/2026/02/five-...
Specifically this paragraph is what I find hilarious.
> According to the report, the issue became apparent in OpenAI’s Codex, an AI code-generation tool. OpenAI staff reportedly attributed some of Codex’s performance limitations to Nvidia’s GPU-based hardware.
They should design their own hardware, then. Somehow the other companies seem to be able to produce fast-enough models.
Interesting
Need to keep the hype going if they are both IPO'ing later this year.
Consider the fact that 7 year old TPUs are still sitting at near 100p utilization today.
Compute.
Google didn't announce $185 billion in capex to do cataloguing and flash cards.
What you can't do is pretend opencode is claude code to make use of that specific claude code subscription.
BTW, loser is spelled with a single o.
For downvoters, you must be naive to think these companies are not surveilling each other through various means.
Seems to be slower/thinks longer.
Also, there is no reason for OpenAI and Anthropic to be trying to one-up each other's releases on the same day. It is hell for the reader.
Looking at the Opus model card I see that they also have by far the highest score for a single model on ARC-AGI-2. I wonder why they didn't advertise that.
I'm firing 10 people now instead of 5!
I encourage people to try. You can even timebox it and come up with some simple things that might look initially insufficient but that discomfort is actually a sign that there's something there. Very similar to moving from not having unit/integration tests for design or regression and starting to have them.
While I love Codex and believe it's amazing tool, I believe their preparedness framework is out of date. As it is more and more capable of vibe coding complex apps, it's getting clear that the main security issues will come up by having more and more security critical software vibe coded.
It's great to look at systems written by humans and how well Codex can be used against software written by humans, but it's getting more important to measure the opposite: how well humans (or their own software) are able to infiltrate complex systems written mostly by Codex, and get better on that scale.
In simpler terms: Codex should write secure software by default.
https://www.nbcnews.com/tech/tech-news/openai-releases-chatg...
I wonder if this will continue to be the case.
"We added some more ACLs and updated our regex"
GPT-5.3-Codex was so good it became my wife!
Meanwhile the prompt: Crop this photo of my passport
https://gist.github.com/simonw/a6806ce41b4c721e240a4548ecdbe...
I suppose coincidences happen too but that just seems too unlikely to believe honestly. Some sort of knowledge leakage does seem like the most likely reason.
| Name | Score |
|---------------------|-------|
| OpenAI Codex 5.3 | 77.3 |
| Anthropic Opus 4.6 | 65.4 |not saying there's a better way but both suck
With the right scaffolding these models are able to perform serious work at high quality levels.
Like can the model take your plan and ask the right questions where there appear to be holes.
How wide of architecture and system design around your language does it understand.
How does it choose to use algorithms available in the language or common libraries.
How often does it hallucinate features/libraries that aren't there.
How does it perform as context get larger.
And that's for one particular language.
I’d feel unscientific and broken? Sure maybe why not.
But at the end of the day I’m going to choose what I see with my own two eyes over a number in a table.
Benchmarks are a sometimes useful to. But we are in prime Goodharts Law Territory.
> We are working to safely enable API access soon.
This week, I'm all local though, playing with opencode and running qwen3 coder next on my little spark machine. With the way these local models are progressing, I might move all my llm work locally.
they forgot to add “Can’t wait to see what you do with it”
> GPT‑5.3‑Codex is our first model that was instrumental in creating itself. The Codex team used early versions to debug its own training
I'm happy to see the Codex team moving to this kind of dogfooding. I think this was critical for Claude Code to achieve its momentum.
Do we still think we'll have soft take off?
There's still no evidence we'll have any take off. At least in the "Foom!" sense of LLMs independently improving themselves iteratively to substantial new levels being reliably sustained over many generations.
To be clear, I think LLMs are valuable and will continue to significantly improve. But self-sustaining runaway positive feedback loops delivering exponential improvements resulting in leaps of tangible, real-world utility is a substantially different hypothesis. All the impressive and rapid achievements in LLMs to date can still be true while major elements required for Foom-ish exponential take-off are still missing.
It feels crazy to just say we might see a fundamental shift in 5 years.
But the current addition to compute and research etc. def goes in this direction I think.
May I at least understand what it has "written". AI help is good but don't replace real programmers completely. I'm enough copy pasting code i don't understand. What if one day AI will fall down and there will be no real programmers to write the software. AI for help is good but I don't want AI to write whole files into my project. Then something may broke and I won't know what's broken. I've experienced it many times already. Told the AI to write something for me. The code was not working at all. It was compiling normally but the program was bugged. Or when I was making some bigger project with ChatGPT only, it was mostly working but after a longer time when I was promting more and more things, everything got broken.
What if you want to write something very complex now that most people don't understand? You keep offering more money until someone takes the time to learn it and accomplish it, or you give up.
I mean, there are still people that hammer out horseshoes over a hot fire. You can get anything you're willing to pay money for.
Anyone know if it is possible to use this model with opencode with the plus subscription?
GPT-5.3-Codex dominates terminal coding with a roughly 12% lead (Terminal-Bench 2.0), while Opus 4.6 retains the edge in general computer use by 8% (OSWorld).
Anyone knows the difference between OSWorld vs OSWorld Verified?
I wish they would share the full conversation, token counts and more. I'd like to have a better sense of how they normalize these comparisons across version. Is this a 3-prompt 10m token game? a 30-prompt 100m token game? Are both models using similar prompts/token counts?
I vibe coded a small factorio web clone [1] that got pretty far using the models from last summer. I'd love to compare against this.
This was built using old versions of Codex, Gemini and Claude. I'll probably work on it more soon to try the latest models.
Can you guys point me ton a single useful, majority LLM-written, preferably reliable, program that solves a non-trivial problem that hasn't been solved before a bunch of times in publicly available code?
Some people just hate progress.
With Codex (5.3), the framing is an interactive collaborator: you steer it mid-execution, stay in the loop, course-correct as it works.
With Opus 4.6, the emphasis is the opposite: a more autonomous, agentic, thoughtful system that plans deeply, runs longer, and asks less of the human.
that feels like a reflection of a real split in how people think llm-based coding should work...
some want tight human-in-the-loop control and others want to delegate whole chunks of work and review the result
Interested to see if we eventually see models optimize for those two philosophies and 3rd, 4th, 5th philosophies that will emerge in the coming years.
Maybe it will be less about benchmarks and more about different ideas of what working-with-ai means
minimaxir•1h ago
crorella•1h ago
zozbot234•1h ago
DonHopkins•1h ago
hoeoek•1h ago
IhateAI•1h ago
observationist•1h ago
Dirty tricks and underhanded tactics will happen - I think Demis isn't savvy in this domain, but might end up stomping out the competition on pure performance.
Elon, Sam, and Dario know how to fight ugly and do the nasty political boardroom crap. 26 is gonna be a very dramatic year, lots of cinematic potential for the eventual AI biopics.
manquer•1h ago
>Dirty tricks and underhanded tactics
As long the tactics are legal ( i.e. not corporate espionage, bribes etc), the no holds barred full free market competition is the best thing for the market and the consumers.
thethimble•53m ago
Model costs continue to collapse while capability improves.
Competition is fantastic.
tedsanders•1h ago
manquer•1h ago
cedws•1h ago
thethimble•54m ago
pixl97•34m ago
vovavili•5m ago