Yeah, this makes sense. There's a fine line between good enough to do security research and good enough to be a prompt kiddie on steroids. At the same time, aligning the models for "safety" would probably make them worse overall, especially when dealing with security questions (i.e. analyse this code snippet and provide security feedback / improvements).
At the end of the day, after some KYC I see no reason why they shouldn't be "in the clear". They get all the positive news (i.e. our gpt666-pro-ultra-krypto-sec found a CVE in openBSD stable release), while not being exposed to tabloid style titles like "a 3 year old asked chatgpt to turn on the lights and chatgpt hacked into nasa, news at 5"...
> GPT‑5.2-Codex has stronger cybersecurity capabilities than any model we’ve released so far. These advances can help strengthen cybersecurity at scale, but they also raise new dual-use risks that require careful deployment.
I'm curious what they mean by the dual-use risks.
Most of the time spent in vulnerability analysis is automatable grunt work. If you can just take that off the table, and free human testers up to think creatively about anomalous behavior identified for them, you're already drastically improving effectiveness.
Just safety nerds being gatekeepers.
They did the same thing for gpt-5.1-codex-max (code name “arcticfox”), delaying its availability in the API and only allowing it to be used by monthly plan users, and as an API user I found it very annoying.
I find it to be incorrectly pattern matching with a very narrow focus and will ignore real documented differences even when explicitly highlighted in the prompt text (this is X crdt algo not Y crdt algo.)
I've canceled my subscription, the idea that on any larger edits it will just start wrecking nuance and then refuse to accept prompts that point this out is an extremely dangerous form of target fixation.
On top of that, the Codex CLI team is responsive on github and it's clear that user complaints make their way to the team responsible for fine tuning these models.
I run bake offs on between all three models and GPT 5.2 generally has a higher success rate of implementing features, followed closely by Opus 4.5 and then Gemini 3, which has troubles with agentic coding. I'm interested to see how 5.2-codex behaves. I haven't been a fan of the codex models in general.
(Also, I can't imagine who is blessed with so much spare tome that they would look down on an assistant that does decent work)
Yeah, it feels really strange sometimes. Bumping up against something that Codex seemingly can't work out, and you give it to Claude and suddenly it's easy. And you continue with Claude and eventually it gets stuck on something, and you try Codex which gets it immediately. My guess would be that the training data differs just enough for it to have an impact.
But if you want that last 10%, codex is vital.
Edit: Literally after I typed this just had this happen. Codex 5.2 reports a P1 bug in a PR. I look closely, I'm not actually sure it's a "bug". I take it to Claude. Claude agrees it's more of a product behavioral opinion on whether or not to persist garbage data, and offer it's own product opinion that I probably want to keep it the way it is. Codex 5.2 meanwhile stubbornly accepts the view it's a product decision but won't seem to offer it's own opinion!
It's because performance degrades over longer conversations, which decreases the chance that the same conversation will result in a solution, and increases the chance that a new one will. I suspect you would get the same result even if you didn't switch to a different model.
They just have different strengths and weaknesses.
since nobody (other than that paper) has been trying to measure output, everything is based on feelings and fashion, like you say.
I'm still raw dogging my code. I'll start using these tools when someone can measure the increase in output. Leadership at work is beginning to claim they can, so maybe the writing is on the wall for me. They haven't shown their methodology for what they are measuring, just telling everyone they "can tell"
But until then, I can spot too many psychological biases inherent in their use to trust my own judgement, especially when the only real study done so far on this subject shows that our intuition lies about this.
And in the meantime, I've already lost time investigating reasonable looking open source projects that turned out to be 1) vibe coded and 2) fully non functional even in the most trivial use. I'm so sick of it. I need a new career
Codex is so so good at finding bugs and little inconsistencies, it's astounding to me. Where Claude Code is good at "raw coding", Codex/GPT5.x are unbeatable in terms of careful, methodical finding of "problems" (be it in code, or in math).
Yes, it takes longer (quality, not speed please!) -- but the things that it finds consistently astound me.
I consistently run into limits with CC (Opus 4.5) -- but even though Codex seems to be spending significantly more tokens, it just seems like the quota limit is much higher?
Managing context goes a long way, too. I clear context for every new task and keep the local context files up to date with key info to get the LLM on target quickly
So if you look at the total cost of running the benchmark, it's surprisingly similar to other models -- the higher price per token is offset by the significantly fewer tokens required to complete a task.
See "Cost to Run Artificial Analysis Index" and "Intelligence vs Output Tokens" here
https://artificialanalysis.ai/
...With the obligatory caveat that benchmarks are largely irrelevant for actual real world tasks and you need to test the thing on your actual task to see how well it does!
Experiencing that repeatedly motivated me to use it as a reviewer (which another commenter noted), a role which it is (from my experience) very good at.
I basically use it to drive Claude Code, which will nuke the codebase with abandon.
(unrelated, but piggybacking on requests to reach the teams)
If anyone from OpenAI or Google is reading this, please continue to make your image editing models work with the "previz-to-render" workflow.
Image edits should strongly infer pose and blocking as an internal ControlNet, but should be able to upscale low-fidelity mannequins, cutouts, and plates/billboards.
OpenAI kicks ass at this (but could do better with style controls - if I give a Midjourney style ref, use it) :
https://imgur.com/gallery/previz-to-image-gpt-image-1-x8t1ij...
https://imgur.com/a/previz-to-image-gpt-image-1-5-3fq042U
Google fails the tests currently, but can probably easily catch up :
This is a privacy and security risk. Your code diffs and prompts are there (seemingly) forever. Best you can do is "archive" them, which is a fancy word for "put it somewhere else so it doesn't clutter the main page".
I use it because it works out cheaper than Codex Cloud and gives you greater flexibility. Although it doesn't have 5.2-codex yet.
Then again, I wouldn't put much trust into OpenAI's handling of information either way.
Unsure where that could be if you're using Windows.
You know what would be fun to try? Give Codex full access and then ask it to delete that folder, lol.
My only gripe is I wish they'd publish Codex CLI updates to homebrew the same time as npm :)
Claude still tends to add "fluff" around the solution and over-engineer, not that the code doesn't work, it's just that it's ugly
> "In parallel, we’re piloting invite-only trusted access to upcoming capabilities and more permissive models for vetted professionals and organizations focused on defensive cybersecurity work. We believe that this approach to deployment will balance accessibility with safety."
Scary good.
But the good ones are not open. It's not even a matter of money. I know at OpenAI they are invite only for instance. Pretty sure there's vetting and tracking going on behind those invites.
Devstral Small 2 Instruct running locally seems about as capable with the upside that when it's wrong its very obvious instead of covering it in bullshit.
What about 2 weeks before Christmas?
One surprising thing that codex helped with is procrastination. I'm sure many people had this feeling when you have some big task and you don't quite know where to start. Just send it to Codex. It might not get it right, but it's almost always good starting point that you can quickly iterate on.
Like, do you run a proper experiment where you hand the same task to multiple models several times and compare the results? Not snark by the way, I’m asking in earnest how you pick one model over another.
This is what I do. I have a little TUI that fires off Claude Code, Codex, Gemini, Qwen Coder and AMP in separate containers for most task I do (although I've started to use AMP less and less), and either returns the last message of what they replied and/or a git diff of what exactly they did. Then I compare them side by side. If all of them got something wrong, I update the prompt, fire them off again. Always starting from zero, and always include the full context of what you're doing with the first message, they're all non-interactive sessions.
Sometimes I do 3x Codex instead of different agents, just to double-check that all of them would do the same thing. If they go off and do different things from each other, I know the initial prompt isn't specific/strict enough, and again iterate.
Honestly, I'd love to try that. My Gmail username is the same as my HN username.
Gemini 3 and Gemini 3 Flash identified the root cause and nailed the fix. GPT 5.1 Codex misdiagnosed the issue and attempted a weird fix despite my prompt saying “don’t write code, simply investigate.”
I run these tests regularly, and Codex has not impressed me. Not even once. At best it’s on par, but most of the time it just fails miserably.
Languages: JavaScript, Elixir, Python
And same thought with both procrastination because of not knowing where to start, but also getting stuck in the middle and not knowing where to go. Literally never happens anymore. Having discussions with it for doing the planning and different options for implementations, and you get to the end with a good design description and then, what's the point of writing the code yourself when with that design, it's going to write it quickly and matching the agreements.
(here I am remembering a time I had no computer and would program data structures in OCaml with pen and paper, then would go to university the next day to try it. Often times it worked the first try)
Especially in the CLI, it seems that its so way too eager to start writing code nothing can stop it, not even the best Agents.md.
Asking it a question or telling it to check something doesn‘t mean it should start editing code, it means answer the question. All models have this issue to some degree, but codex is the worst offender for me.
I feel the same. CodexTheModel (why have two things named the same way?!) is a good deal faster than the other models, and probably on the "fast/accuracy" scale it sits somewhere else, but most code I want to be as high quality as possible, and the base models do seem better at that than CodexTheModel.
I see people gushing over these codex models but they seem worse than the big gpt models in my own actual use (i.e. I'll give the same prompt to gpt-5.1 and gpt-5.1-codex and codex will give me functional but weird/ugly code, whereas gpt-5.1 code is cleaner)
I can imagine what the vetting looks like: The professionals are not allowed to disclose that the models don't work.
EDIT: It must really hurt that ORCL is down 40% from its high due to overexposure in OpenAI.
Lookin for some small feedback on my AI coded website template
(Delete if not ok)
The models are so good, unbelievable good. And getting better weekly, including pricing.
exacube•2h ago
"The most advanced agentic coding model for professional software engineers"
koakuma-chan•2h ago
cj•2h ago
koakuma-chan•2h ago
HarHarVeryFunny•2h ago
If you want to compare the weakest models from both companies then Gemini Flash vs GPT Instant would seem to be best comparison, although Claude Opus 4.5 is by all accounts the most powerful for coding.
In any case, it will take a few weeks for any meaningful test comparisons to be made, and in the meantime it's hard not to see any release from OpenAI since they announced "Code Red" (aka "we're behind the competition") a few days ago as more marketing than anything else.
BeetleB•2h ago
koakuma-chan•2h ago
Gemini 3 Pro is a great foundation model. I use as a math tutor, and it's great. I previously used Gemini 2.5 Pro as a math tutor, and Gemini 3 Pro was a qualitative improvement over that. But Gemini 3 Pro sucks at being a coding agent inside a harness. It sucks at tool calling. It's borderline unusable in Cursor because of that, and likely the same in Antigravity. A few weeks ago I attended a demo of Antigravity that Google employees were giving, and it was completely broken. It got stuck for them during the demo, and they ended up not being able to show anything.
Opus 4.5 is good, and faster than GPT-5.2, but less reliable. I use it for medium difficulty tasks. But for anything serious—it's GPT 5.2
postalcoder•2h ago
Just yesterday, in Antigravity, while applying changes, it deleted 500 lines of code and replaced it with a `<rest of code goes here>`. Unacceptable behavior in 2025, lol.
walthamstow•2h ago
Tostino•45m ago
koakuma-chan•37m ago
nunodonato•2h ago
koakuma-chan•2h ago
nunodonato•2h ago
koakuma-chan•2h ago
NoveltyEngine•23m ago
dkdcio•2h ago
koakuma-chan•2h ago
dkdcio•2h ago
again I’m not saying Codex is worse, they’re just different and claiming the only one you actively use is the best is a stretch
edit: also FWIW, I initially dismissed Claude Code at launch, then loved Codex when it released. never really liked Cursor. now I primarily use Claude Code given I found Codex slow and less “reliable” in a sense, but I try to try all 3 and keep up with the changes (it is hard)
koakuma-chan•1h ago
Such as?
> again I’m not saying Codex is worse, they’re just different and claiming the only one you actively use is the best is a stretch
I am testing all models in Cursor.
> I initially dismissed Claude Code at launch, then loved Codex when it released. never really liked Cursor
I also don't actually like Cursor. It's a VSCode fork, and a mediocre harness. I am only using it because my company refuses to buy anything else, because Cursor has all models, and it appears to them that it's not worth having anything else.
dkdcio•1h ago
> Such as?
changelog is here: https://github.com/anthropics/claude-code/blob/main/CHANGELO...
glhf
btw you started this thread with pure vibes, no evidence:
> I can confirm GPT 5.2 is better than Gemini and Claude. GPT 5.2 Codex is probably even better.
I’m saying you’re wrong. N=2, 1 against 1, one of us is making a much less bold claim
koakuma-chan•1h ago
> “prompting”/harness that improves how it actually performs
Is an abstract statement without any meaningful details.
nunodonato•1h ago
speedgoose•2h ago