It sounds like it can make simple tasks much more correct. It's impressive to me. Today coding agent tends to pretend they're working hard by generating lots of unnecessary code. Hope it's true
I'm happy with medium reasoning. My projects have been in Go, Typescript, React Dockerfiles stuff like that. The code almost always works, it's usually not "Clean code" though.
SWE-bench performance is similar to normal gpt-5, so it seems the main delta with `gpt-5-codex` is on code refactors (via internal refactor benchmark 33.9% -> 51.3%).
As someone who recently used Codex CLI (`gpt-5-high`) to do a relatively large refactor (multiple internal libs to dedicated packages), I kept running into bugs introduced when the model would delete a file and then rewrite in (missing crucial or important details). My approach would have been to just the copy the file over and then make package-specific changes, so maybe better tool calling is at play here.
Additionally, they claim the new model is more steerable (both with AGENTS.md and generally). In my experience, Codex CLI w/gpt-5 is already a lot more steerable than Claude Code, but any improvements are welcome!
[0]https://github.com/openai/codex/blob/main/codex-rs/core/gpt_...
[1]https://github.com/openai/codex/blob/main/codex-rs/core/prom...
GPT-5 may underwhelm with the same sparse prompt, as it seems to do exactly what's asked, not more
You can still "fully vibe" with GPT-5, but the pattern works better in two steps:
1. Plan (iterate on high-level spec/PRD, split into actions)
2. Build (work through plans)
Splitting the context here is important, as any LLM will perform worse as the context gets more polluted.
SWE-bench is a great eval, but it's very narrow. Two models can have the same SWE-bench scores but very different user experiences.
Here's a nice thread on X about the things that SWE-bench doesn't measure:
Prompting is different, but in a good way.
With Claude Code, you can use less prompting, and Claude will get token happy and expand on your request. Great for greenfield/vibing, bad for iterating on existing projects.
With Codex CLI, GPT-5 seems to handle instructions much more precisely. It won't just go off on it's own and do a bunch of work, it will do what you ask.
I've found that being more specific up-front gets better results with GPT-5, whereas with Claude, being more specific doesn't necessarily stop the eagerness of it's output.
As with all LLMs, you can't compare apples to oranges, so to clarify, my experiences are primarily with Typescript and Rust codebases.
It seems about half my sessions quickly become "why did you do that? rip __ out and just do ___". Then again, most of the other sessions involve Codex correctly inferring what I wanted without having to be so specific.
GPT5 + Codex CLI has been pretty productive for me. It's able to get a lot right in a simple prompt without getting too distracted with other crap. It's not perfect, but it's pretty good.
I actually worry GPT5-Codex will make it worse on that aspect though. One of the best parts of GPT5/Codex CLI is that it tends to plan and research first, then make code.
Claude Code has been a revelation and a bit of a let down the past 45 days.
Some open acknowledgement would have been great, but in lieu of it, it seems it's best to hop on a new tool and make sure you learn how to prompt better and not rely on the model to read between until usage is "optimized" and it no longer seems to work for those folks.
I've seen some interesting files that help any model understand a programming language as it's strong suit and it might not even be an expert in and how to best develop with it.
It's a shame because the plan is a great deal but the number of all caps and profanity laced messages I'm firing off at Claude is too damned high.
I am also bullying Claude more nowadays. Seeing this thread, I might give Codex another go (I was on Codex CLI before Claude Code. At that time, Claude blew Codex out of the water but something's changed)
It seems that the concept of file moving isn't something Codex (and other clis) handle well yet. (Same goes for removing. I've ~never seen success in tracking moves and removes in the git commit if I ask for one)
So it was part simplification (dedupe+consolidate), and part moving files around.
And sure, you can use an IDE, but that's harder to do if you live in vibe land. (We really need to understand that for some things, we have perfectly fine non-AI answers, but that's not the world as it is right now. Mechanical refactors, move + import fixes, autocomplete - all of those do not require an LLM. We're not great at drawing that line yet)
Beyond that, purely anecdotal and subjective, but this model does seem to do extensive refactors with semi precise step-by-step guidance a bit faster (comparing GPT-5 Thinking (Medium) and GPT-5 Codex (Medium)), though adherence to prompts seems roughly equivalent between the two as of now. In any case, I really feel they should consider a more nuanced naming convention.
New Claude Sonnet 3.7 was a bit of a blunder, but overall, Anthropic has their marketing in tight order compared to OpenAI. Claude Code, Sonnet, Opus, those are great, clear differentiating names.
Codex meanwhile can mean anything from a service for code reviews with Github integration to a series of dedicated models going back to 2021.
Also, while I do enjoy the ChatGPT app integration for quick on-the-go work made easier with a Clicks keyboard, I am getting more annoyed by the drift between Codex VSCode, Codex Website and Codex in the ChatGPT mobile app. The Website has a very helpful Ask button, which can also be used to launch subtasks via prompts written by the model, but such a button is not present in the VSCode plugin, despite subtasks being something you can launch from the VSCode plugin if you have used Ask via the website first. Meanwhile, the iOS app has no Ask button and no sub task support and neither the app, nor VSCode plugin show remote work done beyond abbreviations, whereas the web page does show everything. Then there are the differences between local and remote via VSCode and the CLI, ... To people not using Codex, this must sound insane and barely understandable, but it seems that is the outcome of spreading yourself across so many fields. CLI, dedicated models, VSCode plugin, mobile app, code review, web page, some like Anthropic only work on one or two, others like Augment three, but no one else does that much, for better and worse.
I like using Codex, but it is a mess with such massive potential that needs a dedicated team lead whose only focus is to untangle this mess, before adding more features. Alternatively, maybe interview a few power user on their actual day to day experience, those that aren't just in one, but are using multiple or all parts of Codex. There is a lot of insight to be gained from someone who has an overview off the entire product stack, I think. Sending out a questionnaire to top users would be a good start, I'd definitely answer.
Ffs...
Regardless of that, Codex needs to both come to Android for parity and the app features need to be expanded towards parity with the web page.
[0] https://searchengineland.com/report-75-percent-of-googles-mo...
It is inevitable.
And before someone says it, I do happen to have my own codex like environment complete with development containers, browser, github integration, etc.
And I'm happy to pay a mint for access to the best models.
>For developers using Codex CLI via API key, we plan to make GPT‑5-Codex available in the API soon.
Ditched my Claude code max sub for the ChatGPT pro $200 plan. So much faster, and not hit any limits yet.
The $200 pro feels good value personally.
What?
I also use claude or codex but i do not find them much useful for what i do.
For me it was (1) lack of MCP (2) excessive hand-holding required (3) availability of the Max plan meant I wouldn't have to monitor cost.
Claude Code's skill curve felt slightly longer than aider's, and had a higher ceiling.
I think if I were more cost-sensitive, I would give Aider another whirl, or Gemini CLI (based on what I've heard, I haven't used it yet).
If you ask it to plan something or suggest something. It'll write the suggestion and dive right into implementation with zero hesitation.
Both were struggling yesterday, with Claude being a bit ahead. Their biggest problems came with being "creative" (their solutions were pretty "stock"), and they had trouble making the simulation.
Tried the same problem on Codex today. The design it came up with still felt a bit lackluster, but it did _a lot_ better on the simulation.
LLM designed UIs will always look generic/stock if you don’t give it additional prompting because of how LLMs work - they’ve memorized certain design patterns and if you don’t specify what you want they will always default to a certain look.
Try adding additional UI instructions to your prompts. Tell it what color scheme you want, what design choices you prefer, etc. Or tell it to scan your existing app’s design and try to match it. Often the results will be much better this way.
I’m imagining if it can navigate the codebase and modify tests - like add new cases or break the tests by changing a few lines. This can actually verify if the tests were doing actual assertions and being useful.
Thorough reviewing like this probably benefits me the most - more than AI assisted development.
For people that have not tried it in say ~1 month, give Codex CLI a try.
My essentials for any coding agent are proper whitelists for allowed commands (you can run uv run <anything>, but rm requires approval every time) and customisable slash commands.
I can live without hooks and subagents.
This is a nearly impossible problem to solve.
uv run rm *
Sandboxing and limiting its blast radius is the only reliable solution. For more inspiration: https://gtfobins.github.io/
npm ERR! code 1 npm ERR! path /usr/local/lib/node_modules/@openai/codex/node_modules/@vscode/ripgrep npm ERR! command failed npm ERR! command sh -c node ./lib/postinstall.js npm ERR! /usr/local/lib/node_modules/@openai/codex/node_modules/@vscode/ripgrep/lib/download.js:199 npm ERR! zipFile?.close();
npm ERR! code 1
npm ERR! path /usr/local/lib/node_modules/@openai/codex/node_modules/@vscode/ripgrep
npm ERR! command failed
npm ERR! command sh -c node ./lib/postinstall.js
npm ERR! /usr/local/lib/node_modules/@openai/codex/node_modules/@vscode/ripgrep/lib/download.js:199
npm ERR! zipFile?.close();
npm ERR! ^
npm ERR!
npm ERR! SyntaxError: Unexpected token '.'Optional chaining was added in v14 (2020), and it sure looks like that is the issue here.
Poor code + doc hygiene is the problem here.
unset GITHUB_TOKEN
npm install --global @openai/codex
Granted, it's not an apples-to-apples comparison since Codex has the advantage of working in a fully scaffolded codebase where it only has to paint by numbers, but my overall experience has been significantly better since switching.
2) ask it to implement plan
That's the way to work with Claude.
Other systems don't have a bespoke "planning" mode and there you need to "tune your input prompt" as they just rush in to implementation by guessing what you wanted
The Codex support at the moment requires adding a comment "@codex review" which then initiates a cloud based review.
You can, however, directly invoke Codex CLI from a GitHub workflow to do things like perform a code review.
On high it is totally unusable.
It is super annoying that it either vibe codes and just edits and use tools, or it has a plan mode, but no in-between where it asks me whether it's fine it does a or b.
I'm not understanding why it lacks such a capability, why in the world would I want to choose between having to copy paste the edits or auto accept them by default...
I want it to help me come up with a plan, execute and check and edit every single edit but with the UX offered by claude, Codex is simply atrocious, I regret spending 23 euros on this.
I see the visual studio code extension does offer something like this, but the UX/UI is terrible, doesn't OAI have people testing those things?
The code is unreadable in that small window[1], doesn't show the lines above/below, it doesn't have IDE tooling (can't inspect types e.g.).
https://i.imgur.com/mfPpMlI.png
This is just not good, that's the kind of AI that slows me, doesn't help at all.
It has a very specific style and if your project isn't in that style, it starts to enforce it -> "making a mess".
I have an extensive array of tripwires, provenance chain verifications and variance checks in my code, and I have to treat Claude as adversarial when I let it touch my research. Not a great sign.
I don't have a test like that on hand so I'm really curious what all you prompted the model, what it suggested, and how much your knowledge as a SWE enabled that workflow.
I'd like a more concrete understanding if the mind blowing nature is attainable for any average SWE, an average Joe that tinkers, or only a top decile engineer.
I chose to turn on Cursor's pay per usage within the Pro plan (so I paid $25, $20+$5 usage, instead of upgrading to $60/m) in order to keep using Claude because it's faster than Grok
I just don't see how this can be considered productive - I am waiting 20 minutes staring at it "thinking" while it does trivial tasks on a virtually bare repo. I guess for async agents its not such a big deal if they are slow as molasses, as you can run dozens of them, but you need a structured codebase for that - and i am already hours in and haven't even gotten a skeleton.
I have read through all the docs, watched the video. It would be so much quicker just to write this code myself. What am I doing wrong? Is it just super slow because they are over capacity or is this just the current state of the art?
There's even an article I read about this the other week, but I can't seem to find it ATM.
The challenge is that you have to be working on multiple work streams at once because so far Codex isn't great at not doing work you are doing in another task even if you tell it something like "class X will have a function that returns y"...it will go write that function most times.
I've found it really good for integration work between frontend and backend features where you can iterate on both simultaneously if the code isn't in the same codebase.
Also, for Codex this works best in the web ui because it actually uses branches, opens prs, etc. I think (though could be wrong) that locally with the CLI or IDE extension you might have to manually great git worktrees, etc.
If I ever find my self just waiting, then it always gives me an opportunity to respond to messages, emails, or update tickets. Won't be long now until the agents are doing that as well...
When i use that approach I end up merging one PR, then have to prod the others to fix themselves - resolving conflicts, removing duplicate code, etc - so it ends up slower than just running one agent at a time.
Like i said - maybe this is a problem on a bare repo? But if so, how are people vibe coding from scratch and calling themselves productive? I just don't get it.
incomingpain•4mo ago
NitpickLawyer•4mo ago
incomingpain•4mo ago
NitpickLawyer•4mo ago
I have it go to openrouter, and then you just export the API key and run codex, works smoothly.