I have a pretty complex project, so I need to keep an eye on it to ensure it doesn't go off the rails and delete all the code to get a build to pass (it wouldn't be the first time).
In fact you should convert your code to spaces at least before LLM sees it. It’ll improve your results by looking more like its training data.
> Reason #3a: Work with the model biases, not against
Another note on model biases is that you should lean into them. The tricky part with this is the only way to figure out a model's defaults is to have actual usage and careful monitoring (or have evals that let you spot it).
Instead of forcing the model to behave in ways it ignores, adapt your prompts and post-processing to embrace its defaults. You'll save tokens and get better results.
If the model keeps hallucinating some JSON fields, maybe you should support (or even encourage) those fields instead of trying to prompt the model against them.
Cut wood with the grain, not against it.
I don't like that at all. Actually running the code is the single most effective protection we have against coding mistakes, from both humans and machines.
I think it's absolutely worth the complexity and performance overhead of hooking up a real container environment.
Not to mention you can run a useful code execution container in 100MB of RAM on a single CPU (or slice thereof). Simulating that with an LLM takes at least one GPU and 100GB or more of VRAM.
No, I didn’t know running containers used “virtually no overhead.” It appears I can run millions of containers without any resource constraint? Is that some sort of cheat code?
But when I installed Codex and tried to make a simple code bugfix, I got rate limited nearly immediately. As in, after 3 "steps" the agent took.
Are you meant to only use Codex with their $200 "unlimited" plans? Thanks!
gpt-5-2025-08-07
38.887K input tokens
That was my usage, and I got rate limited. Thank you for your tips!OpenAI say API access to that model is coming soon, at which point till be able to use it in Codex CLI with an API key and pay for tokens as you go.
You can also use the Codex CLI tool without using the new GPT-5-Codex model.
I was tempted to give Codex a try but a colleague was stung by their pricing. Apparently if you go over your Pro plan allocation, they just quietly and automatically start billing you per-token?
https://cdn.openai.com/pdf/97cc5669-7a25-4e63-b15f-5fd5bdc4d...
SWE-bench performance is similar to normal gpt-5, so it seems the main delta with `gpt-5-codex` is on code refactors (via internal refactor benchmark 33.9% -> 51.3%).
As someone who recently used Codex CLI (`gpt-5-high`) to do a relatively large refactor (multiple internal libs to dedicated packages), I kept running into bugs introduced when the model would delete a file and then rewrite it (missing crucial or important details). My approach would have been to just the copy the file over and then make package-specific changes, so maybe better tool calling is at play here.
Additionally, they claim the new model is more steerable (both with AGENTS.md and generally).
In my experience, Codex CLI w/gpt-5 is already a lot more steerable than Claude Code, but any improvements are welcome!
[0]https://github.com/openai/codex/blob/main/codex-rs/core/gpt_...
[1]https://github.com/openai/codex/blob/main/codex-rs/core/prom...
(comment reposted from other thread)
What worked was getting it to first write a detailed implementation plan for a “junior contractor” then attempt it in phases (clearing task window each time) and told to use /tmp to copy files and transform them then update the original.
Looking forward to trying the new model out on the next refactor!
Will try adding the instructions specific to refactors (i.e. copy/move files, don't rewrite when possible)
I've also found it helpful, especially for certain regressions, to basically create a new branch for any Codex/CC assisted task (even if part of a larger task). Makes it easier to identify regressions due to recent changes (i.e. look at git diff, it worked previously)
Telling the "agent" to manage git leads to more context pollution than I want, so I manage all commits/branches myself, but I'm sure that will change as the tools improve/they do more RL on full-cycle software dev
This model is available inside all OpenAI codex products. Yet to be available on Api
The model is supposed to be better at code reviews and Comments than the other GPT-5 variant. It can also think/work upto 7 hours.
Even shorter version:
- New coding-specialist model called GPT-5-Codex, coming soon to the API but for now available in their Codex CLI, VS Code and Codex Cloud products
- New code review product (part of Codex Cloud) that can review PRs for you
- New model promises better code review, less pointless comments and can vary its reasoning effort for simple vs complex tasks
So a bit in line with what Theo mentioned in his video that he was not happy with the ui capabilities
Updates (v0.36) https://github.com/openai/codex/releases
Commented after I saw this added in today’s release notes: “initial MCP interface and docs”
Had to spend quite a long time to figure out a dependency error...
I can't use the IDE codex at all now it seems.
> OpenAI
> We rely on many of OpenAI's models to give AI responses. Requests may be sent to OpenAI even if you have an Anthropic (or someone else's) model selected in chat (e.g. for summarization)*. We have a zero data retention agreement with OpenAI.
Source: https://cursor.com/security
I will say that the Security page by the Cursor team is a very nice overview, even going into Auth, etc. and applaud that, but see nothing here that differentiates their use of e.g. OpenAI models from the agreements OpenAI offers themselves. Essentially, I don't see why anyone would have such severely heightened trust in Cursor over competitors in this area. If they only provided self hosted models, I could understand it, but not the way they operate.
Personally, both because of the way and on what LLMs have been trained on, on top of my expectation in terms of privacy, regardless of model provider assurances, I'd treat any LLM derived/assisted/reviewed code as public the second you send it to some providers server hosted model and some form of FOSS to boot. Basically, if you used Cursor, Codex, Augment or anything of that sort, I'd reduce any future privacy expectations straight away, might as well put it on public Github for everyone to see.
Only self-hosting on prem is an option for keeping control of your codebase, though personally, I'd still consider licensing such code as FOSS, considering no model wasn't trained on EUPL, GPL, etc. Personal (very much philosophical and not at all legal, as that goes into what training is, weights, etc. arguments that can go on eternal) opinion, but I'd argue whether you are MSFT or a small startup, if you derive a significant amount of new code from LLMs, arguing that copyleft shouldn't be at the very least on the mind of your legal department isn't reasonable, but of course, this will have to be decided by courts and likely in favour of those with the best legal teams. I doubt if any of the "80% of our code is written by LLMs" were true, that'd convince a court to enforce copyleft upon the product in question, but personally, that'd be my viewpoint.
Regardless of licensing, if you send your code to Cursor, purely privacy wise, you shouldn't have reservations about OpenAI.
I've just started out trying out Claude Code and am not sure how Codex compares on React projects.
From my initial usage, it seems Claude Code planning mode is superior than its normal? mode, and giving it an overall direction to proceed and rather than just stating a desired feature seems to produce better results. It also does better if a large task are split into very small sub-tasks.
The main issues with Codex now seem to be the very poor stability (it seems to be down almost 50% of the time) and lack of custom containers. Hoping those get solved soon, particularly the stability.
I also wonder where the price will end up, it currently seems unsustainably cheap.
Jetbrains has a $30/mo subscription (with gpt5 backend) and the quota burns so fast.
Assuming Jetbrains is at breakeven price, either OpenAI has some secret sauce or they're losing money for Codex.
bayesianbot•4mo ago
EnPissant•4mo ago
- The smartest model I have used. Solves problems better than Opus-4.1.
- It can be lazy. With Claude Code / Opus, once given a problem, it will generally work until completion. Codex will often perform only the first few steps and then ask if I want to continue to do the rest. It does this even if I tell it to not stop until completion.
- I have seen severe degradation near max context. For example, I have seen it just repeat the next steps every time I tell it to continue and I have to manually compact.
I'm not sure if the problems are Gpt-5 or Codex. I suspect a better Codex could resolve them.
brookst•4mo ago
Very frustrating, and happening more often.
elliot07•4mo ago
conception•4mo ago
M4v3R•4mo ago
EnPissant•4mo ago
Jcampuzano2•4mo ago
But they have suffered quite a lot of degradation and quality issues recently.
To be honest unless Anthropic does something very impactful sometime soon I think they're losing their moat they had with developers as more and more jump to codex and other tools. They kind of massively threw their lead imo.
EnPissant•4mo ago
darkteflon•4mo ago
faangguyindia•4mo ago
tanvach•4mo ago
bayesianbot•4mo ago
apigalore•4mo ago
With GPT‑5-Codex they do write: "During testing, we've seen GPT‑5-Codex work independently for more than 7 hours at a time on large, complex tasks, iterating on its implementation, fixing test failures, and ultimately delivering a successful implementation." https://openai.com/index/introducing-upgrades-to-codex/
mritchie712•4mo ago
mmaunder•4mo ago
nightshift1•4mo ago
naiv•4mo ago
have to say not sure what this even means and what the exact definition of a message is in this context.
with claude code max20 I was constantly hitting limits, with codex not once yet
mmaunder•4mo ago
mike_hearn•4mo ago
GPT-5 is a great model. I tried Codex CLI Rust, as they seem to be deprecating the JS version, and it is awful. I don't know what possessed them to try and write a TUI in Rust but it isn't working. The Claude Code UI is hugely superior.
mmaunder•4mo ago
Jcampuzano2•4mo ago
Everyone else slowly caught up and/or surpassed them while they simultaneously had quality control issues and service degradation plaguing their system - ALL while having the most expensive models comparatively in terms of intelligence.
mmaunder•4mo ago
bjackman•4mo ago
My experience after a month or so of heavy use is exactly this. The AI is rock solid. I'm pretty consistently impressed with its ability to derive insights from the code, when it works. But the client is flaky, the backend is flaky, and the overall experience for me is always "I wish I could just use Claude".
Say 1 in 10 queries craps out (often the client OOMs even though I have 192Gb of RAM). Sounds like a 10% reliability issue but actually it just pushes me into "fuck this I'll just do it myself" so it knocks out like 50% of the value of the product.
(Still, I wouldn't be surprised if this can be fixed over the next few months, it could easily be very competitive IMO).
faxmeyourcode•4mo ago
macNchz•4mo ago
bjackman•4mo ago
None of these tools give the impression of being well-tested software. My guess is that neither OpenAI nor Anthropic actually has the necessary density in expertise to build quality software. Google obviously can build good software _when it really wants to_ but in this space its strategy looks like "build the products the other guys are building, cut whatever corners necessary to do this absolutely as fast as possible".
So even if my initial impressions are more accurate it's quite possible Google wins long term here.
dumpsterdiver•4mo ago
Remember the whole “Taken 3 makes Taken 2 look like Taken 1” meme? Well Google’s latest video generating AI makes any video gen AI I’ve seen up until now look like Taken 3* (sigh, I said 1, ruined it) - and they are seriously impressive on their own.
Edit: By “they” I mean the other video generating AI makes models, not the other Taken movies. I hope Liam Neeson doesn't read HN, because a delivery like that might not make him laugh.
echelon•4mo ago
Antitrust enforcement has been letting us down for over two decades. If we don't have an oxygenation event, we'll go an entire generation where we only reward tax-collecting, non-innovation capital. That's unhealthy and unfair.
Our career sector has been institutionalized and rewards the 0.001% even as they rest on their laurels and conspire to suppress wages and innovation. There's a reason why centicorns petered out and why the F500 is tech-heavy. It's because big tech is a dragnet that consumes everything it touches - film studios, grocery stores, and God only knows what else it'll assimilate in the unending search for unregulated, cancerous growth.
FAANG's $500k TC is at the expense of hundreds of unicorns making their ICs even wealthier. That money mostly winds up going to institutional investors, where the money sits parked instead of flowing into huge stakes risks and cutthroat competition. That's why a16z and YC want to see increased antitrust regulations.
But it's really bad for consumers too. It's why our smartphones are stagnant taxation banana republics with one of two landlords. Nothing new, yet as tightly controlled an authoritarian state. New ideas can't be tried and can't attain healthy margins.
It's wild that you can own a trademark, but the only way for a consumer to access it is to use a Google browser that defaults to Google search (URLs are scary), where the search results will be gamed by competitors. You can't even own your own brand anymore.
Winning shouldn't be easy. It should be hard. A neverending struggle that rewards consumers.
We need a forest fire to renew the ecosystem.
andai•4mo ago
- all the users
- all the apps (Google, GMail, YouTube, Docs, Maps...)
- all the books (Google Books)
- all the video (YouTube)
- all the web pages
- custom hardware
It's honestly weird they aren't doing better. Agree that the models are great and the UX is bad all around.
LordDragonfang•4mo ago
Unfortunately, they've been insulated from the consequences of their bad decisions by the fact the money printer (ads) keeps their company afloat and mollifies shareholders. The moment that dries up, they're in trouble.
echelon•4mo ago
I don't think they care what we think. They're thriving despite our protests.
But yeah, they shouldn't be shielded from antitrust. They have literally everything.
brianjking•4mo ago
- all the lobbyists - all the money
bobbylarrybobby•4mo ago
zamalek•4mo ago
notfromhere•4mo ago
ttul•4mo ago
epolanski•4mo ago
It's super annoying that it doesn't provide a way to approve edits one by one instead it either vibe codes on its own or gives me diffs to copy paste.
Claude code has a much saner "normal mode".
brianjking•4mo ago
epolanski•4mo ago
Is it via CLI? Is it via extension to an editor? What is your flow?
ttul•4mo ago
arthurcolle•4mo ago
dwohnitmok•4mo ago
codehead•4mo ago
FergusArgyll•4mo ago
ollybee•4mo ago
gizmodo59•4mo ago
FergusArgyll•4mo ago
Tiberium•4mo ago
FergusArgyll•4mo ago
robotswantdata•4mo ago
Gemini cli is too inconsistent, good for documentation tasks. Don’t let it write code for you
icelancer•4mo ago
robbrulinski•4mo ago
nowittyusername•4mo ago
brianjking•4mo ago
Or we're all just used to eating things we don't like and smiling.
DanielVZ•4mo ago
bionhoward•4mo ago
vitorgrs•4mo ago
Not sure the fault it's "doing bad code", I guess it's just not being good at being agentic. Saw this on Gemini CLI and other tools.
GLM, Kimi, Qwen-Code all behaves better for me.
Probably Gemini 3 will fix this, as Gemini 2.5 Pro it's "old" by now.
faangguyindia•4mo ago
troupo•4mo ago
Claude Code does that on longer tasks.
Time to give Codex a try I guess.