The new Opus 4.6 scores 65.4 on Terminal-Bench 2.0, up from 64.7 from GPT-5.2-codex.
GPT-5.3-codex scores 77.3.
That said ... I do think Codex 5.2 was the best coding model for more complex tasks, albeit quite slow.
So very much looking forward to trying out 5.3.
You can take off your tinfoil hat. The same models can perform differently depending on the programming language, frameworks and libraries employed, and even project. Also, context does matter, and a model's output greatly varies depending on your prompt history.
AI agents, perhaps? :-D
Every new model overfits to the latest overhyped benchmark.
Someone should take this to a logical extreme and train a tiny model that scores better on a specific benchmark.
It's not just over-fitting to leading benchmarks, there's also too many degrees of freedom in how a model is tested (harness, etc). Until there's standardized documentation enabling independent replication, it's all just benchmarketing .
I use 5.2 Codex for the entire task, then ask Opus 4.5 at the end to double check the work. It's nice to have another frontier model's opinion and ask it to spot any potential issues.
Looking forward to trying 5.3.
Hopefully performance will pick up after the rollout.
Cost to Run Artificial Analysis Intelligence Index:
GPT-5.2 Codex (xhigh): $3244
Claude Opus 4.5-reasoning: $1485
(and probably similar values for the newer models?)
I really do wonder whats the chain here. Did Sam see the Opus announcement and DM someone a minute later?
GPT-4o vs. Google I/O (May 2024): OpenAI scheduled its "Spring Update" exactly 24 hours before Google’s biggest event of the year, Google I/O. They launched GPT-4o voice mode.
Sora vs. Gemini 1.5 Pro (Feb 2024): Just two hours after Google announced its breakthrough Gemini 1.5 Pro model, Sam Altman tweeted the reveal of Sora (text-to-video).
ChatGPT Enterprise vs. Google Cloud Next (Aug 2023): As Google began its major conference focused on selling AI to businesses, OpenAI announced ChatGPT Enterprise.
This is hilarious lol
In case you missed it. For example:
Nvidia's $100 billion OpenAI deal has seemingly vanished - Ars Technica
https://arstechnica.com/information-technology/2026/02/five-...
Specifically this paragraph is what I find hilarious.
> According to the report, the issue became apparent in OpenAI’s Codex, an AI code-generation tool. OpenAI staff reportedly attributed some of Codex’s performance limitations to Nvidia’s GPU-based hardware.
They should design their own hardware, then. Somehow the other companies seem to be able to produce fast-enough models.
Interesting
Need to keep the hype going if they are both IPO'ing later this year.
Consider the fact that 7 year old TPUs are still sitting at near 100p utilization today.
Compute.
Google didn't announce $185 billion in capex to do cataloguing and flash cards.
What you can't do is pretend opencode is claude code to make use of that specific claude code subscription.
BTW, loser is spelled with a single o.
For downvoters, you must be naive to think these companies are not surveilling each other through various means.
Seems to be slower/thinks longer.
Also, there is no reason for OpenAI and Anthropic to be trying to one-up each other's releases on the same day. It is hell for the reader.
Looking at the Opus model card I see that they also have by far the highest score for a single model on ARC-AGI-2. I wonder why they didn't advertise that.
I'm firing 10 people now instead of 5!
I encourage people to try. You can even timebox it and come up with some simple things that might look initially insufficient but that discomfort is actually a sign that there's something there. Very similar to moving from not having unit/integration tests for design or regression and starting to have them.
While I love Codex and believe it's amazing tool, I believe their preparedness framework is out of date. As it is more and more capable of vibe coding complex apps, it's getting clear that the main security issues will come up by having more and more security critical software vibe coded.
It's great to look at systems written by humans and how well Codex can be used against software written by humans, but it's getting more important to measure the opposite: how well humans (or their own software) are able to infiltrate complex systems written mostly by Codex, and get better on that scale.
In simpler terms: Codex should write secure software by default.
https://www.nbcnews.com/tech/tech-news/openai-releases-chatg...
I wonder if this will continue to be the case.
GPT-5.3-Codex was so good it became my wife!
Meanwhile the prompt: Crop this photo of my passport
https://gist.github.com/simonw/a6806ce41b4c721e240a4548ecdbe...
I suppose coincidences happen too but that just seems too unlikely to believe honestly. Some sort of knowledge leakage does seem like the most likely reason.
| Name | Score |
|---------------------|-------|
| OpenAI Codex 5.3 | 77.3 |
| Anthropic Opus 4.6 | 65.4 |not saying there's a better way but both suck
With the right scaffolding these models are able to perform serious work at high quality levels.
Like can the model take your plan and ask the right questions where there appear to be holes.
How wide of architecture and system design around your language does it understand.
How does it choose to use algorithms available in the language or common libraries.
How often does it hallucinate features/libraries that aren't there.
How does it perform as context get larger.
And that's for one particular language.
I’d feel unscientific and broken? Sure maybe why not.
But at the end of the day I’m going to choose what I see with my own two eyes over a number in a table.
Benchmarks are a sometimes useful to. But we are in prime Goodharts Law Territory.
> We are working to safely enable API access soon.
This week, I'm all local though, playing with opencode and running qwen3 coder next on my little spark machine. With the way these local models are progressing, I might move all my llm work locally.
they forgot to add “Can’t wait to see what you do with it”
> GPT‑5.3‑Codex is our first model that was instrumental in creating itself. The Codex team used early versions to debug its own training
I'm happy to see the Codex team moving to this kind of dogfooding. I think this was critical for Claude Code to achieve its momentum.
Do we still think we'll have soft take off?
May I at least understand what it has "written". AI help is good but don't replace real programmers completely. I'm enough copy pasting code i don't understand. What if one day AI will fall down and there will be no real programmers to write the software. AI for help is good but I don't want AI to write whole files into my project. Then something may broke and I won't know what's broken. I've experienced it many times already. Told the AI to write something for me. The code was not working at all. It was compiling normally but the program was bugged. Or when I was making some bigger project with ChatGPT only, it was mostly working but after a longer time when I was promting more and more things, everything got broken.
What if you want to write something very complex now that most people don't understand? You keep offering more money until someone takes the time to learn it and accomplish it, or you give up.
I mean, there are still people that hammer out horseshoes over a hot fire. You can get anything you're willing to pay money for.
Anyone know if it is possible to use this model with opencode with the plus subscription?
minimaxir•1h ago
crorella•1h ago
zozbot234•1h ago
DonHopkins•1h ago
hoeoek•1h ago
IhateAI•1h ago
observationist•1h ago
Dirty tricks and underhanded tactics will happen - I think Demis isn't savvy in this domain, but might end up stomping out the competition on pure performance.
Elon, Sam, and Dario know how to fight ugly and do the nasty political boardroom crap. 26 is gonna be a very dramatic year, lots of cinematic potential for the eventual AI biopics.
manquer•56m ago
>Dirty tricks and underhanded tactics
As long the tactics are legal ( i.e. not corporate espionage, bribes etc), the no holds barred full free market competition is the best thing for the market and the consumers.
thethimble•34m ago
Model costs continue to collapse while capability improves.
Competition is fantastic.
tedsanders•1h ago
manquer•58m ago
cedws•55m ago
thethimble•35m ago
pixl97•16m ago