It sounds like it can make simple tasks much more correct. It's impressive to me. Today coding agent tends to pretend they're working hard by generating lots of unnecessary code. Hope it's true
SWE-bench performance is similar to normal gpt-5, so it seems the main delta with `gpt-5-codex` is on code refactors (via internal refactor benchmark 33.9% -> 51.3%).
As someone who recently used Codex CLI (`gpt-5-high`) to do a relatively large refactor (multiple internal libs to dedicated packages), I kept running into bugs introduced when the model would delete a file and then rewrite in (missing crucial or important details). My approach would have been to just the copy the file over and then make package-specific changes, so maybe better tool calling is at play here.
Additionally, they claim the new model is more steerable (both with AGENTS.md and generally). In my experience, Codex CLI w/gpt-5 is already a lot more steerable than Claude Code, but any improvements are welcome!
[0]https://github.com/openai/codex/blob/main/codex-rs/core/gpt_...
[1]https://github.com/openai/codex/blob/main/codex-rs/core/prom...
SWE-bench is a great eval, but it's very narrow. Two models can have the same SWE-bench scores but very different user experiences.
Here's a nice thread on X about the things that SWE-bench doesn't measure:
New Claude Sonnet 3.7 was a bit of a blunder, but overall, Anthropic has their marketing in tight order compared to OpenAI. Claude Code, Sonnet, Opus, those are great, clear differentiating names.
Codex meanwhile can mean anything from a service for code reviews with Github integration to a series of dedicated models going back to 2021.
Also, while I do enjoy the ChatGPT app integration for quick on-the-go work made easier with a Clicks keyboard, I am getting more annoyed by the drift between Codex VSCode, Codex Website and Codex in the ChatGPT mobile app. The Website has a very helpful Ask button, which can also be used to launch subtasks via prompts written by the model, but such a button is not present in the VSCode plugin, despite subtasks being something you can launch from the VSCode plugin if you have used Ask via the website first. Meanwhile, the iOS app has no Ask button and no sub task support and neither the app, nor VSCode plugin show remote work done beyond abbreviations, whereas the web page does show everything. Then there are the differences between local and remote via VSCode and the CLI, ... To people not using Codex, this must sound insane and barely understandable, but it seems that is the outcome of spreading yourself across so many fields. CLI, dedicated models, VSCode plugin, mobile app, code review, web page, some like Anthropic only work on one or two, others like Augment three, but no one else does that much, for better and worse.
I like using Codex, but it is a mess with such massive potential that needs a dedicated team lead whose only focus is to untangle this mess, before adding more features. Alternatively, maybe interview a few power user on their actual day to day experience, those that aren't just in one, but are using multiple or all parts of Codex. There is a lot of insight to be gained from someone who has an overview off the entire product stack, I think. Sending out a questionnaire to top users would be a good start, I'd definitely answer.
incomingpain•1h ago