frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Beginning January 2026, all ACM publications will be made open access

https://dl.acm.org/openaccess
965•Kerrick•5h ago•104 comments

We pwned X, Vercel, Cursor, and Discord through a supply-chain attack

https://gist.github.com/hackermondev/5e2cdc32849405fff6b46957747a2d28
255•hackermondev•2h ago•77 comments

GPT-5.2-Codex

https://openai.com/index/introducing-gpt-5-2-codex/
213•meetpateltech•2h ago•134 comments

Skills for organizations, partners, the ecosystem

https://claude.com/blog/organization-skills-and-directory
191•adocomplete•4h ago•117 comments

Texas is suing all of the big TV makers for spying on what you watch

https://www.theverge.com/news/845400/texas-tv-makers-lawsuit-samsung-sony-lg-hisense-tcl-spying
125•tortilla•2d ago•65 comments

Delty (YC X25) Is Hiring an ML Engineer

https://www.ycombinator.com/companies/delty/jobs/MDeC49o-machine-learning-engineer
1•lalitkundu•11m ago

T5Gemma 2: The next generation of encoder-decoder models

https://blog.google/technology/developers/t5gemma-2/
33•milomg•1h ago•3 comments

Classical statues were not painted horribly

https://worksinprogress.co/issue/were-classical-statues-painted-horribly/
464•bensouthwood•8h ago•231 comments

How China built its ‘Manhattan Project’ to rival the West in AI chips

https://www.japantimes.co.jp/business/2025/12/18/tech/china-west-ai-chips/
52•artninja1988•2h ago•56 comments

How did IRC ping timeouts end up in a lawsuit?

https://mjg59.dreamwidth.org/73777.html
61•dvaun•1d ago•4 comments

FunctionGemma 270M Model

https://blog.google/technology/developers/functiongemma/
76•mariobm•2h ago•24 comments

How to hack Discord, Vercel and more with one easy trick

https://kibty.town/blog/mintlify/
40•todsacerdoti•1h ago•9 comments

Show HN: Picknplace.js, an alternative to drag-and-drop

https://jgthms.com/picknplace.js/
33•bbx•2d ago•16 comments

Show HN: Stop AI scrapers from hammering your self-hosted blog (using porn)

https://github.com/vivienhenz24/fuzzy-canary
49•misterchocolat•2d ago•10 comments

TRELLIS.2: state-of-the-art large 3D generative model (4B)

https://github.com/microsoft/TRELLIS.2
30•dvrp•1d ago•3 comments

Your job is to deliver code you have proven to work

https://simonwillison.net/2025/Dec/18/code-proven-to-work/
498•simonw•6h ago•423 comments

Meta Segment Anything Model Audio

https://ai.meta.com/samaudio/
83•megaman821•2d ago•9 comments

The most banned books in U.S. schools

https://pen.org/top-52-banned-books-since-2021/
56•FigurativeVoid•2h ago•155 comments

How I wrote JustHTML, a Python-based HTML5 parser, using coding agents

https://friendlybit.com/python/writing-justhtml-with-coding-agents/
29•simonw•4d ago•16 comments

Using TypeScript to obtain one of the rarest license plates

https://www.jack.bio/blog/licenseplate
114•lafond•6h ago•116 comments

Firefox will have an option to disable all AI features

https://mastodon.social/@firefoxwebdevs/115740500373677782
105•twapi•2h ago•114 comments

Oliver Sacks put himself into his case studies – what was the cost?

https://www.newyorker.com/magazine/2025/12/15/oliver-sacks-put-himself-into-his-case-studies-what...
6•barry-cotter•21m ago•51 comments

I've been writing ring buffers wrong all these years (2016)

https://www.snellman.net/blog/archive/2016-12-13-ring-buffers/
16•flaghacker•2d ago•2 comments

The Scottish Highlands, the Appalachians, Atlas are the same mountain range

https://vividmaps.com/central-pangean-mountains/
28•lifeisstillgood•1h ago•9 comments

Please just try HTMX

http://pleasejusttryhtmx.com/
345•iNic•6h ago•305 comments

Show HN: Spice Cayenne – SQL acceleration built on Vortex

https://spice.ai/blog/introducing-spice-cayenne-data-accelerator
21•lukekim•2h ago•2 comments

The <time> element should do something

https://nolanlawson.com/2025/12/14/the-time-element-should-actually-do-something/
35•birdculture•2d ago•7 comments

Military standard on software control levels

https://entropicthoughts.com/mil-std-882e-software-control
47•ibobev•4h ago•21 comments

Ringspace: A proposal for the human web

https://taggart-tech.com/ringspace/
14•todsacerdoti•17h ago•3 comments

Interactive Fluid Typography

https://electricmagicfactory.com/articles/interactive-fluid-typography/
14•list•1h ago•0 comments
Open in hackernews

GPT-5.2-Codex

https://openai.com/index/introducing-gpt-5-2-codex/
213•meetpateltech•2h ago

Comments

exacube•2h ago
would love to see some comparison numbers to Gemini and Claude, especially with this claim:

"The most advanced agentic coding model for professional software engineers"

koakuma-chan•2h ago
I can confirm GPT 5.2 is better than Gemini and Claude. GPT 5.2 Codex is probably even better.
cj•2h ago
Gemini 2.5 or 3? (3 was released yesterday)
koakuma-chan•2h ago
I tried Gemini 3 Flash, and I am unimpressed. It's maybe a competitor to Cursor's Compose-1, but completely different league from GPT 5.2
HarHarVeryFunny•2h ago
Surely Gemini 3.0 Pro would be the appropriate comparison.

If you want to compare the weakest models from both companies then Gemini Flash vs GPT Instant would seem to be best comparison, although Claude Opus 4.5 is by all accounts the most powerful for coding.

In any case, it will take a few weeks for any meaningful test comparisons to be made, and in the meantime it's hard not to see any release from OpenAI since they announced "Code Red" (aka "we're behind the competition") a few days ago as more marketing than anything else.

BeetleB•2h ago
Gemini 3.0 Flash outperforms Pro in many tasks - I believe the coding benchmark was one of them.
koakuma-chan•2h ago
That's what I said in my original message. By my account, GPT 5.2 is better than Gemini 3 Pro and Opus 4.5

Gemini 3 Pro is a great foundation model. I use as a math tutor, and it's great. I previously used Gemini 2.5 Pro as a math tutor, and Gemini 3 Pro was a qualitative improvement over that. But Gemini 3 Pro sucks at being a coding agent inside a harness. It sucks at tool calling. It's borderline unusable in Cursor because of that, and likely the same in Antigravity. A few weeks ago I attended a demo of Antigravity that Google employees were giving, and it was completely broken. It got stuck for them during the demo, and they ended up not being able to show anything.

Opus 4.5 is good, and faster than GPT-5.2, but less reliable. I use it for medium difficulty tasks. But for anything serious—it's GPT 5.2

postalcoder•2h ago
Agreed. Gemini 3 is still pretty bad at agentic coding.

Just yesterday, in Antigravity, while applying changes, it deleted 500 lines of code and replaced it with a `<rest of code goes here>`. Unacceptable behavior in 2025, lol.

walthamstow•2h ago
Glad I'm not alone in thinking Flash 3 was like Composer 1 in speed but smarter
Tostino•45m ago
3 has been out for at least a couple weeks for me.
koakuma-chan•37m ago
He meant 3 Flash, which came out recently
nunodonato•2h ago
I'm gonna call bs on these kind of comments. "better" on what? Coding models shouldn't even be compared isolated. A big part of making it work in a real/big codebase is the tool that calls the model (claude code, gemini-cli, etc). I'll bet claude code will still keep stealing your lunch every day of the week against any competitor out there
koakuma-chan•2h ago
I haven't used CC in a few months, what killer features have they added? I am using Cursor, it's clunky, but not that clunky so as to completely destroy model performance. I am pretty sure for my tasks (undocumented, buggy, legacy JavaScript project) GPT-5.2 is > all on any decent harness, because it doesn't give up or half-ass. It can run for 5 minutes or for 50 minutes, depending on your request.
nunodonato•2h ago
it's not about features (although they've added plenty), its the internal tooling and the way the model is prompted.
koakuma-chan•2h ago
The only thing I know that CC has that Cursor hasn't, is the ability to spawn agents. You can just prompt CC "spawn 10 agents" and it will make 10 subagents that run concurrently. But otherwise, I don't know what CC does that Cursor doesn't. On the contrary, AFAIK, CC doesn't index your codebase, and Cursor does.
NoveltyEngine•23m ago
Surely CC has a lower price? How much do you have to pay Cursor for equivalent to what's provided in a 20x Claude Max plan?
dkdcio•2h ago
lol bold claim initially for not using the primary competitor in months. I try to use all 3 (Claude Code, Codex CLI, Gemini CLI); there are tradeoffs between all 3
koakuma-chan•2h ago
Read my reply to sibling comment. To my knowledge, Claude Code is at most marginally better than Cursor, and it's mostly the model that matters. Not saying there is no room for improvement on the tooling side, but no one seems to have come up with anything so far. Let me know which killer features Claude Code has, I would be happy to learn.
dkdcio•2h ago
it’s the “agentic harness” — they have shipped tons of great features for the DevEx, but it’s the combination of better models (Sonnet 4.5 1M, now Opus 4.5) and the “prompting”/harness that improves how it actually performs

again I’m not saying Codex is worse, they’re just different and claiming the only one you actively use is the best is a stretch

edit: also FWIW, I initially dismissed Claude Code at launch, then loved Codex when it released. never really liked Cursor. now I primarily use Claude Code given I found Codex slow and less “reliable” in a sense, but I try to try all 3 and keep up with the changes (it is hard)

koakuma-chan•1h ago
> they have shipped tons of great features for the DevEx

Such as?

> again I’m not saying Codex is worse, they’re just different and claiming the only one you actively use is the best is a stretch

I am testing all models in Cursor.

> I initially dismissed Claude Code at launch, then loved Codex when it released. never really liked Cursor

I also don't actually like Cursor. It's a VSCode fork, and a mediocre harness. I am only using it because my company refuses to buy anything else, because Cursor has all models, and it appears to them that it's not worth having anything else.

dkdcio•1h ago
you conveniently ignored the most important part of my comment :)

> Such as?

changelog is here: https://github.com/anthropics/claude-code/blob/main/CHANGELO...

glhf

btw you started this thread with pure vibes, no evidence:

> I can confirm GPT 5.2 is better than Gemini and Claude. GPT 5.2 Codex is probably even better.

I’m saying you’re wrong. N=2, 1 against 1, one of us is making a much less bold claim

koakuma-chan•1h ago
You do not seem to be able to tell me anything substantial, i.e. specifically how Claude Code is a better harness than Cursor.

> “prompting”/harness that improves how it actually performs

Is an abstract statement without any meaningful details.

nunodonato•1h ago
we dont have capability to see the inner working of claude code, its not open source. You just use it and you see the difference. I've tried all of them, including anti-gravity. Nothing beats claude code
speedgoose•2h ago
It’s significantly slower though. At least for my use cases I rather ask Claude 4.5 opus and switch to GPT if Claude is stuck.
whinvik•2h ago
I actually have 0 enthusiasm for this model. When GPT 5 came out it was clearly the best model, but since Opus 4.5, GPT5.x just feels so slow. So, I am going to skip all `thinking` releases from OpenAI and check them again only if they come up with something that does not rely so much on thinking.
seneca•2h ago
I hope this makes a big jump forward for them. I used to be a heavy Codex user, but it has just been so much worse than Claude Code both in UX and in actual results that I've completely given up on it. Anthropic needs a real competitor to keep them motivated and they just don't have one right now, so I'd really like to see OpenAI get back in the game.
GenerWork•2h ago
GPT 5.2 has gotten a lot better at building UI elements when given a Figma MCP server link. I used to use Claude for building brand new UI elements based on the Figma link, but 5.2 caught up to a point where I'm probably going to cancel Claude.
seneca•2h ago
Nice, I'll have to give that a shot. I often use Claude for exactly that.
NitpickLawyer•2h ago
> In parallel, we’re piloting invite-only trusted access to upcoming capabilities and more permissive models for vetted professionals and organizations focused on defensive cybersecurity work. We believe that this approach to deployment will balance accessibility with safety.

Yeah, this makes sense. There's a fine line between good enough to do security research and good enough to be a prompt kiddie on steroids. At the same time, aligning the models for "safety" would probably make them worse overall, especially when dealing with security questions (i.e. analyse this code snippet and provide security feedback / improvements).

At the end of the day, after some KYC I see no reason why they shouldn't be "in the clear". They get all the positive news (i.e. our gpt666-pro-ultra-krypto-sec found a CVE in openBSD stable release), while not being exposed to tabloid style titles like "a 3 year old asked chatgpt to turn on the lights and chatgpt hacked into nasa, news at 5"...

larrymcp•2h ago
Can anyone elaborate on what they're referring to here?

> GPT‑5.2-Codex has stronger cybersecurity capabilities than any model we’ve released so far. These advances can help strengthen cybersecurity at scale, but they also raise new dual-use risks that require careful deployment.

I'm curious what they mean by the dual-use risks.

pixl97•2h ago
Finding/patching exploits means you also can exploit them better?
throwaway127482•2h ago
They did some interesting wordsmithing here to cover their ass without saying it directly.
whynotminot•1h ago
What they said sounded pretty direct to me.
runtimepanic•2h ago
“Dual-use” here usually isn’t about novel attack techniques, but about lowering the barrier to execution. The same improvements that help defenders reason about exploit chains, misconfigurations, or detection logic can also help an attacker automate reconnaissance, payload adaptation, or post-exploitation analysis. Historically, this shows up less as “new attacks” and more as speed and scale shifts. Things that required an experienced operator become accessible to a much wider audience. That’s why deployment controls, logging, and use-case constraints matter as much as the raw capability itself.
baq•2h ago
probably that it's good on tasks of either color teams, red or blue - and if it is, it means you can automate some... interesting workflows.
tgtweak•2h ago
Good at finding/fixing security vulnerabilities = Good at finding/exploiting security vulnerabilities.
dpoloncsak•1h ago
"Please review this code for any security vulnerabilities" has two very different outcomes depending on if its the maintainer or threat actor prompting the model
OldGreenYodaGPT•2h ago
GPT 5.2 has been very good in codex can't wait to try this new modal. Will see how it compares to Opus 4.5
tptacek•2h ago
It's interesting that they're foregrounding "cyber" stuff (basically: applied software security testing) this way, but I think we've already crossed a threshold of utility for security work that doesn't require models to advance to make a dent --- and won't be responsive to "responsible use" controls. Zero-shotting is a fun stunt, but in the real world what you need is just hypothesis identification (something the last few generations of models are fine at) and then quick building of tooling.

Most of the time spent in vulnerability analysis is automatable grunt work. If you can just take that off the table, and free human testers up to think creatively about anomalous behavior identified for them, you're already drastically improving effectiveness.

trunnell•2h ago
Why aren’t they making gpt-5.2-codex available in the API at launch?
kingstnap•2h ago
> we’re piloting invite-only trusted access to upcoming capabilities and more permissive models

Just safety nerds being gatekeepers.

trunnell•1h ago
That’s for future unreleased capabilities and models, not the model released today.

They did the same thing for gpt-5.1-codex-max (code name “arcticfox”), delaying its availability in the API and only allowing it to be used by monthly plan users, and as an API user I found it very annoying.

MallocVoidstar•1h ago
They can't train on the API.
dist-epoch•36m ago
They say it's because it's too good at hacking stuff.
fellowniusmonk•2h ago
In all my unpublished tests, which focus on 1. unique logic puzzles that are intentionally adjacent to existing puzzles and 2. implementing a specific unique CRDT algorithm that is not particularly common but has an official reference implementation on github (so the models definitely been trained on it) I find that 5.2 overfits to the more common implementation and will actively break working code and puzzles.

I find it to be incorrectly pattern matching with a very narrow focus and will ignore real documented differences even when explicitly highlighted in the prompt text (this is X crdt algo not Y crdt algo.)

I've canceled my subscription, the idea that on any larger edits it will just start wrecking nuance and then refuse to accept prompts that point this out is an extremely dangerous form of target fixation.

pillefitz•2h ago
How does Claude perform?
fellowniusmonk•1h ago
They all have difficulty with certain crdts types in general, 4.5 opus has to go through a round of ask to give it clarifying instructions but then it's fine. Neither get it perfectly as a one shot, claude if you jump straight into agent won't break code but will churn for a bit.
postalcoder•2h ago
It has become very quickly unfashionable for people to say they like the Codex CLI. I still enjoy working with it and my only complaint is that its speed makes it unideal for pair coding.

On top of that, the Codex CLI team is responsive on github and it's clear that user complaints make their way to the team responsible for fine tuning these models.

I run bake offs on between all three models and GPT 5.2 generally has a higher success rate of implementing features, followed closely by Opus 4.5 and then Gemini 3, which has troubles with agentic coding. I'm interested to see how 5.2-codex behaves. I haven't been a fan of the codex models in general.

jbm•2h ago
When Claude screws up a task I use Codex and vice versa. It helps a lot when I'm working on libraries that I've never touched before, especially iOS related.

(Also, I can't imagine who is blessed with so much spare tome that they would look down on an assistant that does decent work)

embedding-shape•1h ago
> When Claude screws up a task I use Codex and vice versa

Yeah, it feels really strange sometimes. Bumping up against something that Codex seemingly can't work out, and you give it to Claude and suddenly it's easy. And you continue with Claude and eventually it gets stuck on something, and you try Codex which gets it immediately. My guess would be that the training data differs just enough for it to have an impact.

extr•1h ago
I think Claude is more practically minded. I find that OAI models in general default to the most technically correct, expensive (in terms of LoC implementation cost, possible future maintenance burden, etc) solution. Whereas Claude will take a look at the codebase and say "Looks like a webshit React app, why don't you just do XYZ which gets you 90% of the way there in 3 lines".

But if you want that last 10%, codex is vital.

Edit: Literally after I typed this just had this happen. Codex 5.2 reports a P1 bug in a PR. I look closely, I'm not actually sure it's a "bug". I take it to Claude. Claude agrees it's more of a product behavioral opinion on whether or not to persist garbage data, and offer it's own product opinion that I probably want to keep it the way it is. Codex 5.2 meanwhile stubbornly accepts the view it's a product decision but won't seem to offer it's own opinion!

enraged_camel•42m ago
>> My guess would be that the training data differs just enough for it to have an impact.

It's because performance degrades over longer conversations, which decreases the chance that the same conversation will result in a solution, and increases the chance that a new one will. I suspect you would get the same result even if you didn't switch to a different model.

XenophileJKO•1m ago
So not really, certainly models degrade by some degree on context retrieval. However, in Cursor you can just change the model used for the exchange, it still has the same long context. You'll see the different model strengths and weaknesses contrasted.

They just have different strengths and weaknesses.

dingnuts•2h ago
the faddish nature of these tools fits the narrative of the METR findings that the tools slow you down while making you feel faster.

since nobody (other than that paper) has been trying to measure output, everything is based on feelings and fashion, like you say.

I'm still raw dogging my code. I'll start using these tools when someone can measure the increase in output. Leadership at work is beginning to claim they can, so maybe the writing is on the wall for me. They haven't shown their methodology for what they are measuring, just telling everyone they "can tell"

But until then, I can spot too many psychological biases inherent in their use to trust my own judgement, especially when the only real study done so far on this subject shows that our intuition lies about this.

And in the meantime, I've already lost time investigating reasonable looking open source projects that turned out to be 1) vibe coded and 2) fully non functional even in the most trivial use. I'm so sick of it. I need a new career

qsort•2h ago
I care very little about fashion, whether in clothes or in computers. I've always liked Anthropic products a bit more but Codex is excellent, if that's your jam more power to you.
mccoyb•2h ago
If anyone from OpenAI is reading this -- a plea to not screw with the reasoning capabilities!

Codex is so so good at finding bugs and little inconsistencies, it's astounding to me. Where Claude Code is good at "raw coding", Codex/GPT5.x are unbeatable in terms of careful, methodical finding of "problems" (be it in code, or in math).

Yes, it takes longer (quality, not speed please!) -- but the things that it finds consistently astound me.

tgtweak•2h ago
Anecdotally I've found it very good in the exact same case for multi-agent workflows - as the "reviewer"
ifwinterco•2h ago
I think the issue is for them "quality, not speed" means "expensive, not cheap" and they can't pass that extra cost on to customers
mccoyb•2h ago
I'm happy to pay the same right now for less (on the max plan, or whatever) -- because I'm never running into limits, and I'm running these models near all day every day (as a single user working on my own personal projects).

I consistently run into limits with CC (Opus 4.5) -- but even though Codex seems to be spending significantly more tokens, it just seems like the quota limit is much higher?

Computer0•2h ago
I am on the $20 plan for CC and Codex, I feel like a session of usage on CC == ~20% Codex usage / 5 hours in terms of time spent inferencing. It has always seemed way more geneous than I would expect.
Aurornis•35m ago
Agreed. The $20 plans can go very far when you're using the coding agent as an additional tool in your development flow, not just trying to hammer it with prompts until you get output that works.

Managing context goes a long way, too. I clear context for every new task and keep the local context files up to date with key info to get the LLM on target quickly

andai•29m ago
If you look at benchmarks, the Claude models score significantly higher intelligence per token. I'm not sure how that works exactly, but they are offset from the entire rest of the chart on that metric. It seems they need less tokens to get the same result. (I can't speak for how that affects performance on very difficult tasks though, since most of mine are pretty straightforward.)

So if you look at the total cost of running the benchmark, it's surprisingly similar to other models -- the higher price per token is offset by the significantly fewer tokens required to complete a task.

See "Cost to Run Artificial Analysis Index" and "Intelligence vs Output Tokens" here

https://artificialanalysis.ai/

...With the obligatory caveat that benchmarks are largely irrelevant for actual real world tasks and you need to test the thing on your actual task to see how well it does!

kilroy123•2h ago
Interesting what I've seen is it spins and thinks forever. Then just breaks. Which is beyond frustrating.
mccoyb•2h ago
If by "just breaks" means "refuses to write code / gives up or reverts what it does" -- yes, I've experienced that.

Experiencing that repeatedly motivated me to use it as a reviewer (which another commenter noted), a role which it is (from my experience) very good at.

I basically use it to drive Claude Code, which will nuke the codebase with abandon.

kilroy123•1h ago
I've seen it think for a long time and then just timeout or something? It just stops and nothing happens.
JamesSwift•12m ago
Ive had the same but i only use it through zed so I wasnt sure if it was a codex issue or a zed issue
baq•1h ago
we're all senior continue engineers nowadays it seems
apitman•2h ago
It's annoying though because it keeps (accurately) pointing out critical memory bugs that I clearly need to fix rather than pretending they aren't there. It's slowing me down.
echelon•1h ago
> If anyone from OpenAI is reading this

(unrelated, but piggybacking on requests to reach the teams)

If anyone from OpenAI or Google is reading this, please continue to make your image editing models work with the "previz-to-render" workflow.

Image edits should strongly infer pose and blocking as an internal ControlNet, but should be able to upscale low-fidelity mannequins, cutouts, and plates/billboards.

OpenAI kicks ass at this (but could do better with style controls - if I give a Midjourney style ref, use it) :

https://imgur.com/gallery/previz-to-image-gpt-image-1-x8t1ij...

https://imgur.com/a/previz-to-image-gpt-image-1-5-3fq042U

Google fails the tests currently, but can probably easily catch up :

https://imgur.com/a/previz-to-image-nano-banana-pro-Q2B8psd

79a6ed87•2h ago
My only concern with Codex is that it's not possible to delete tasks.

This is a privacy and security risk. Your code diffs and prompts are there (seemingly) forever. Best you can do is "archive" them, which is a fancy word for "put it somewhere else so it doesn't clutter the main page".

Leynos•2h ago
Terragon is an alternative (hosts Claude and Codex using your OpenAI and Anthropic subscriptions, and also supports Google and Amp) that provides this functionality.

I use it because it works out cheaper than Codex Cloud and gives you greater flexibility. Although it doesn't have 5.2-codex yet.

tgtweak•2h ago
Yes but if it's not getting removed at the origin... it's not fixing the actual issue of the context/conversation surviving past an explicit "delete" request. Also let's not forget that anyone proxying LLMs is also man in the middle to any code that goes up/down.
Leynos•1h ago
79a6ed87’s comment applies to Codex cloud, not the codex CLI, which is what Terragon is using.
throwuxiytayq•2h ago
It's weird, suspicious, and plain annoying. I like the the tool and my tests have shown it to be very powerful (if a bit rough and buggy), but this is ridiculous - I won't use it for any real world projects until this is fixed.

Then again, I wouldn't put much trust into OpenAI's handling of information either way.

moralestapia•2h ago
`rm -rf ~/.codex/archived_sessions` does the trick
79a6ed87•1h ago
Interesting. Where do I run that?
moralestapia•1h ago
Uhm ... I assumed you were on Linux or OS X, if that's the case just open a terminal and paste that, I swear it's not malicious code.

Unsure where that could be if you're using Windows.

You know what would be fun to try? Give Codex full access and then ask it to delete that folder, lol.

zenburnmyface•26m ago
This is A+ satire
sunaookami•1h ago
Are you talking about Codex Web? This is different from Codex CLI.
shanev•2h ago
The GPT models, in my experience, have been much better for backend than the Claude models. They're much slower, but produce logic that is more clear, and code that is more maintainable. A pattern I use is, setup a Github issue with Claude plan mode, then have Codex execute it. Then come back to Claude to run custom code review plugins. Then, of course review it with my own eyes before merging the PR.

My only gripe is I wish they'd publish Codex CLI updates to homebrew the same time as npm :)

mrcwinn•2h ago
I’d agree with you until Opus 4.5.
SamDc73•1h ago
GPT-5 was the first model that occasionally produced code that I could push without any changes

Claude still tends to add "fluff" around the solution and over-engineer, not that the code doesn't work, it's just that it's ugly

lemming•1h ago
Interesting, I have consistently found that Codex does much better code reviews than Claude. Claude will occasionally find real issues, but will frequently bike shed things I don't care about. Codex always finds things that I do actually care about and that clearly need fixing.
k_bx•2h ago
Codex code review has been astounding for my distributed team of devs. Very well spent money.
freedomben•2h ago
The cybersecurity angle is interesting, because in my experience OpenAI stuff has gotten terrible at cybersecurity because it simply refuses to do anything that can be remotely offensive (as in the opposite of "defensive"). I really thought we as an industry had learned our lesson that blocking "good guys" (aka white-hats) from offensive tools/capabilities only empowers the gray-hat/black-hats and puts us at a disadvantage. A good defense requires some offense. I sure hope they change that.
mapontosevenths•2h ago
The article mentions that more permissive models would be invite only. I think it's a solid approach, as long as they don't make getting one of those invites too difficult.

> "In parallel, we’re piloting invite-only trusted access to upcoming capabilities and more permissive models for vetted professionals and organizations focused on defensive cybersecurity work. We believe that this approach to deployment will balance accessibility with safety."

hiAndrewQuinn•2h ago
I'm moving into a cybersecurity-focused role, and I for one would be very interested in this. A vetting process makes total sense, but complete lack of access seems like a market inefficiency in the making that the one area where we can't reliably get the frontier models to assist us in pentesting our own stuff without a lot of hedging.
hhh•2h ago
I use openai models every day for offensive work. haven’t had a problem in a long time
JacobAsmuth•2h ago
So in general you think that making frontier AI models more offensive in black hat capabilities will be good for cybersecurity?
artursapek•2h ago
Of course. Bugs only get patched if they’re found.
abigail95•2h ago
Does it shift the playing field towards bad actors in a way that other tools don't?
Uehreka•2h ago
I’m not GP, but I’d argue that “making frontier AI models more offensive in black hat capabilities” is a thing that’s going to happen whether we want it or not, since we don’t control who can train a model. So the more productive way to reason is to accept that that’s going to happen and then figure out the best thing to do.
bilbo0s•1h ago
Frontier models are good at offensive capabilities.

Scary good.

But the good ones are not open. It's not even a matter of money. I know at OpenAI they are invite only for instance. Pretty sure there's vetting and tracking going on behind those invites.

tptacek•1h ago
People in North American and Western Europe have an extremely blinkered and parochial view of how widely and effectively offensive capabilities are disseminated.
nikanj•2h ago
OpenAI is really weird about this stuff. I tried to get good minor chord progression out of chatgpt, but it kept running into guardrails and giving Very Serious Warnings. It felt as if there’s just a dumb keyword filter in there, and getting any amounts of verboted words will kill the entire prompt
tptacek•1h ago
That's odd, because I'm using plain-old-GPT5 as the backend model for a bunch of offensive stuff and I haven't had any hangups at all. But I'm doing a multi-agent setup where each component has a constrained view of the big picture (ie, a fuzzer agent with tool calls to drive a web fuzzer looking for a particular kind of vulnerability); the high-level orchestration is still mostly human-mediated.
julienfr112•1h ago
More generaly, GPT is being heavily neuterd : For exemple I tried to make it rebuild codex itself. It start to answer, then delete the code and go "I'm not to answer that". As if building codex inside codex is a way to terminator and co..
prettyblocks•29m ago
ChatGPT is very happy to help me with offensive tasks. Codex is as well.
monster_truck•2h ago
Pathetic. They got people working a week before christmas for this?

Devstral Small 2 Instruct running locally seems about as capable with the upside that when it's wrong its very obvious instead of covering it in bullshit.

speedgoose•2h ago
Devstral 2 struggles with the tools syntax in my own testing. Happy to read that it works with some.
chollida1•1h ago
What should companies do with people a week before Christmas if not give them work to do?

What about 2 weeks before Christmas?

tananaev•2h ago
I was very skeptical about Codex at the beginning, but now all my coding tasks start with Codex. It's not perfect at everything, but overall it's pretty amazing. Refactoring, building something new, building something I'm not familiar with. It is still not great at debugging things.

One surprising thing that codex helped with is procrastination. I'm sure many people had this feeling when you have some big task and you don't quite know where to start. Just send it to Codex. It might not get it right, but it's almost always good starting point that you can quickly iterate on.

BinaryIgor•2h ago
I have similar experiences with Claude Code ;) Have you used it as well? How does it compare?
9dev•1h ago
I always wonder how people make qualitative statements like this. There are so many variables! Is it my prompt? The task? The specific model version? A good or bad branch out of the non-deterministic solution space?

Like, do you run a proper experiment where you hand the same task to multiple models several times and compare the results? Not snark by the way, I’m asking in earnest how you pick one model over another.

embedding-shape•1h ago
> Like, do you run a proper experiment where you hand the same task to multiple models several times and compare the results?

This is what I do. I have a little TUI that fires off Claude Code, Codex, Gemini, Qwen Coder and AMP in separate containers for most task I do (although I've started to use AMP less and less), and either returns the last message of what they replied and/or a git diff of what exactly they did. Then I compare them side by side. If all of them got something wrong, I update the prompt, fire them off again. Always starting from zero, and always include the full context of what you're doing with the first message, they're all non-interactive sessions.

Sometimes I do 3x Codex instead of different agents, just to double-check that all of them would do the same thing. If they go off and do different things from each other, I know the initial prompt isn't specific/strict enough, and again iterate.

dotancohen•1h ago
Please share! I'd much rather help develop your solution than vibe code one of my own ))

Honestly, I'd love to try that. My Gmail username is the same as my HN username.

enraged_camel•1h ago
Last night I gave one of the flaky tests in our test suite to three different models, using the exact same prompt.

Gemini 3 and Gemini 3 Flash identified the root cause and nailed the fix. GPT 5.1 Codex misdiagnosed the issue and attempted a weird fix despite my prompt saying “don’t write code, simply investigate.”

I run these tests regularly, and Codex has not impressed me. Not even once. At best it’s on par, but most of the time it just fails miserably.

Languages: JavaScript, Elixir, Python

tmikaeld•49m ago
I have the same experience. To make it worse, there’s a mile of difference between the all too many versions and efforts..
jackschultz•1h ago
Infinitely agree with all. I was skeptical, and then tried Opus 4.5 and was blown away. Codex with 5.0 and 5.1 wasn't great, but 5.2 is big improvement. I can't do code without it because there's no point. Time and quality with the right constraints, you're going to get better code.

And same thought with both procrastination because of not knowing where to start, but also getting stuck in the middle and not knowing where to go. Literally never happens anymore. Having discussions with it for doing the planning and different options for implementations, and you get to the end with a good design description and then, what's the point of writing the code yourself when with that design, it's going to write it quickly and matching the agreements.

nextaccountic•10m ago
You can code without it. Maybe you don't want to, but if you're a programmer, you can

(here I am remembering a time I had no computer and would program data structures in OCaml with pen and paper, then would go to university the next day to try it. Often times it worked the first try)

lacoolj•2h ago
lol I love how OpenAI just straight up doesn't compare their model to others on these release pages. Basically telling us they know Gemini and Opus are better but they don't want to draw attention to it
qwesr123•2h ago
Not sure why they don't compare with others, but they are actually leading on the benchmarks they published. See here (bottom) for a chart comparing to other models: https://marginlab.ai/blog/swe-bench-deep-dive/
mistercheph•1h ago
It's like apple, they just don't want users or anyone to even be thinking of their competitors, the competition doesn't exist, it's not relevant.
whimsicalism•1h ago
is swe-bench saturated? or they switch to swe-bench pro because...?
dbbk•59m ago
This was the one thing I scanned for. No comparison against Opus. See ya.
jasonthorsness•2h ago
Recently I’ve had the best results with Gemini; with this I’ll have to go back to Codex for my next project. It takes time to get a feel for the capabilities of a model it’s sort of tedious having new ones come out so frequently.
tonyhart7•2h ago
very minuscule improvement, I suspect GPT 5.2 is already coding model from the ground up and this codex model include "various optimization + tool" on tops
CjHuber•1h ago
Somehow Codex for me is always way worse than the base models.

Especially in the CLI, it seems that its so way too eager to start writing code nothing can stop it, not even the best Agents.md.

Asking it a question or telling it to check something doesn‘t mean it should start editing code, it means answer the question. All models have this issue to some degree, but codex is the worst offender for me.

w-m•1h ago
Just use the non-codex models for investigation and planning, they listen to "do not edit any files yet, just reply here in chat". And they're better at getting the bigger picture. Then you can use the -codex variant for execution of a carefully drafted plan.
embedding-shape•1h ago
> Somehow Codex for me is always way worse than the base models.

I feel the same. CodexTheModel (why have two things named the same way?!) is a good deal faster than the other models, and probably on the "fast/accuracy" scale it sits somewhere else, but most code I want to be as high quality as possible, and the base models do seem better at that than CodexTheModel.

JeremyNT•1h ago
Same experience here.

I see people gushing over these codex models but they seem worse than the big gpt models in my own actual use (i.e. I'll give the same prompt to gpt-5.1 and gpt-5.1-codex and codex will give me functional but weird/ugly code, whereas gpt-5.1 code is cleaner)

lkt•1h ago
I've been doing some reverse engineering recently and have found Gemini 3 Pro to be the best model for that, surprisingly much better than Opus 4.5. Maybe it's time to give Codex a try
mistercheph•1h ago
Gotta love only comparing the model to other openai models and just like yesterday's gemini thread, the vibes in this thread are so astroturfed. I guess it makes sense for the frontier labs to want to win the hearts and minds of silicon valley.
tptacek•1h ago
Please don't post insinuations about astroturfing, shilling, brigading, foreign agents, and the like. It degrades discussion and is usually mistaken. If you're worried about abuse, email hn@ycombinator.com and we'll look at the data.

https://news.ycombinator.com/newsguidelines.html

mistercheph•1h ago
Sorry, didn't realize.
abshkbh•1h ago
We have made this model even better at programming in Windows. Give it a shot :)
catigula•1h ago
The models aren't smart enough to be fully agentic. This is why Claude Code human-in-the-loop process is 100x more ergonomic.
bgwalter•1h ago
They found one React bug and spend pages on "frontier" "cyber" nonsense. They make these truly marvelous models only available to "vetted" "security professionals".

I can imagine what the vetting looks like: The professionals are not allowed to disclose that the models don't work.

EDIT: It must really hurt that ORCL is down 40% from its high due to overexposure in OpenAI.

chiengineer•32m ago
https://news.ycombinator.com/item?id=46317068

Lookin for some small feedback on my AI coded website template

(Delete if not ok)

ianberdin•24m ago
Thanks gosh, we have so bloody competition.

The models are so good, unbelievable good. And getting better weekly, including pricing.