I have a pretty complex project, so I need to keep an eye on it to ensure it doesn't go off the rails and delete all the code to get a build to pass (it wouldn't be the first time).
In fact you should convert your code to spaces at least before LLM sees it. It’ll improve your results by looking more like its training data.
> Reason #3a: Work with the model biases, not against
Another note on model biases is that you should lean into them. The tricky part with this is the only way to figure out a model's defaults is to have actual usage and careful monitoring (or have evals that let you spot it).
Instead of forcing the model to behave in ways it ignores, adapt your prompts and post-processing to embrace its defaults. You'll save tokens and get better results.
If the model keeps hallucinating some JSON fields, maybe you should support (or even encourage) those fields instead of trying to prompt the model against them.
I don't like that at all. Actually running the code is the single most effective protection we have against coding mistakes, from both humans and machines.
I think it's absolutely worth the complexity and performance overhead of hooking up a real container environment.
Not to mention you can run a useful code execution container in 100MB of RAM on a single CPU (or slice thereof). Simulating that with an LLM takes at least one GPU and 100GB or more of VRAM.
But when I installed Codex and tried to make a simple code bugfix, I got rate limited nearly immediately. As in, after 3 "steps" the agent took.
Are you meant to only use Codex with their $200 "unlimited" plans? Thanks!
OpenAI say API access to that model is coming soon, at which point till be able to use it in Codex CLI with an API key and pay for tokens as you go.
You can also use the Codex CLI tool without using the new GPT-5-Codex model.
I was tempted to give Codex a try but a colleague was stung by their pricing. Apparently if you go over your Pro plan allocation, they just quietly and automatically start billing you per-token?
https://cdn.openai.com/pdf/97cc5669-7a25-4e63-b15f-5fd5bdc4d...
SWE-bench performance is similar to normal gpt-5, so it seems the main delta with `gpt-5-codex` is on code refactors (via internal refactor benchmark 33.9% -> 51.3%).
As someone who recently used Codex CLI (`gpt-5-high`) to do a relatively large refactor (multiple internal libs to dedicated packages), I kept running into bugs introduced when the model would delete a file and then rewrite it (missing crucial or important details). My approach would have been to just the copy the file over and then make package-specific changes, so maybe better tool calling is at play here.
Additionally, they claim the new model is more steerable (both with AGENTS.md and generally).
In my experience, Codex CLI w/gpt-5 is already a lot more steerable than Claude Code, but any improvements are welcome!
[0]https://github.com/openai/codex/blob/main/codex-rs/core/gpt_...
[1]https://github.com/openai/codex/blob/main/codex-rs/core/prom...
(comment reposted from other thread)
This model is available inside all OpenAI codex products. Yet to be available on Api
The model is supposed to be better at code reviews and Comments than the other GPT-5 variant. It can also think/work upto 7 hours.
Had to spend quite a long time to figure out a dependency error...
bayesianbot•1h ago
EnPissant•1h ago
- The smartest model I have used. Solves problems better than Opus-4.1.
- It can be lazy. With Claude Code / Opus, once given a problem, it will generally work until completion. Codex will often perform only the first few steps and then ask if I want to continue to do the rest. It does this even if I tell it to not stop until completion.
- I have seen severe degradation near max context. For example, I have seen it just repeat the next steps every time I tell it to continue and I have to manually compact.
I'm not sure if the problems are Gpt-5 or Codex. I suspect a better Codex could resolve them.
brookst•1h ago
Very frustrating, and happening more often.
elliot07•1h ago
conception•1h ago
M4v3R•1h ago
EnPissant•1h ago
Jcampuzano2•1h ago
But they have suffered quite a lot of degradation and quality issues recently.
To be honest unless Anthropic does something very impactful sometime soon I think they're losing their moat they had with developers as more and more jump to codex and other tools. They kind of massively threw their lead imo.
EnPissant•1h ago
darkteflon•40m ago
tanvach•1h ago
bayesianbot•1h ago
mritchie712•1h ago
mmaunder•1h ago
mmaunder•1h ago
Jcampuzano2•1h ago
Everyone else slowly caught up and/or surpassed them while they simultaneously had quality control issues and service degradation plaguing their system - ALL while having the most expensive models comparatively in terms of intelligence.
mmaunder•1h ago
bjackman•45m ago
My experience after a month or so of heavy use is exactly this. The AI is rock solid. I'm pretty consistently impressed with its ability to derive insights from the code, when it works. But the client is flaky, the backend is flaky, and the overall experience for me is always "I wish I could just use Claude".
Say 1 in 10 queries craps out (often the client OOMs even though I have 192Gb of RAM). Sounds like a 10% reliability issue but actually it just pushes me into "fuck this I'll just do it myself" so it knocks out like 50% of the value of the product.
(Still, I wouldn't be surprised if this can be fixed over the next few months, it could easily be very competitive IMO).
faxmeyourcode•5m ago
FergusArgyll•55m ago
ollybee•50m ago
gizmodo59•46m ago
Tiberium•43m ago
robotswantdata•5m ago
Gemini cli is too inconsistent, good for documentation tasks. Don’t let it write code for you