frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Claude Opus 4.6

https://www.anthropic.com/news/claude-opus-4-6
804•HellsMaddy•2h ago•360 comments

GPT-5.3-Codex

https://openai.com/index/introducing-gpt-5-3-codex/
518•meetpateltech•1h ago•189 comments

Orchestrate teams of Claude Code sessions

https://code.claude.com/docs/en/agent-teams
169•davidbarker•2h ago•81 comments

Don't rent the cloud, own instead

https://blog.comma.ai/datacenter/
967•Torq_boi•14h ago•403 comments

There Will Come Soft Rains (1950) [pdf]

https://www.btboces.org/Downloads/7_There%20Will%20Come%20Soft%20Rains%20by%20Ray%20Bradbury.pdf
25•wallflower•4d ago•5 comments

Ardour 9.0 Released

https://ardour.org/whatsnew.html
96•PaulDavisThe1st•1h ago•15 comments

A small, shared skill library by builders, for builders. (human and agent)

https://github.com/PsiACE/skills
19•recrush•1h ago•0 comments

European Commission Trials Matrix to Replace Teams

https://www.euractiv.com/news/commission-trials-european-open-source-communications-software/
241•Arathorn•3h ago•126 comments

The New Collabora Office for Desktop

https://www.collaboraonline.com/collabora-office/
118•mfld•6h ago•66 comments

Advancing finance with Claude Opus 4.6

https://claude.com/blog/opus-4-6-finance
73•da_grift_shift•2h ago•12 comments

Psychometric Jailbreaks Reveal Internal Conflict in Frontier Models

https://arxiv.org/abs/2512.04124
19•toomuchtodo•1h ago•18 comments

Maihem (YC W24): hiring sr robotics perception engineer (London, on-site)

https://jobs.ashbyhq.com/maihem/8da3fa8b-5544-45de-a99e-888021519758
1•mxrns•3h ago

Flock CEO calls Deflock a "terrorist organization" [video]

https://www.youtube.com/watch?v=l-kZGrDz7PU
77•cdrnsf•1h ago•15 comments

150 MB Minimal FreeBSD Installation

https://vermaden.wordpress.com/2026/02/01/150-mb-minimal-freebsd-installation/
84•vermaden•4d ago•11 comments

Anthropic's Claude Opus 4.6 uncovers 500 zero-day flaws in open-source code

https://www.axios.com/2026/02/05/anthropic-claude-opus-46-software-hunting
88•speckx•1h ago•45 comments

We tasked Opus 4.6 using agent teams to build a C Compiler

https://www.anthropic.com/engineering/building-c-compiler
98•modeless•59m ago•82 comments

Company as Code

https://blog.42futures.com/p/company-as-code
178•ahamez•7h ago•94 comments

When internal hostnames are leaked to the clown

https://rachelbythebay.com/w/2026/02/03/badnas/
396•zdw•14h ago•212 comments

GB Renewables Map

https://renewables-map.robinhawkes.com/
104•RobinL•7h ago•37 comments

Nanobot: Ultra-Lightweight Alternative to OpenClaw

https://github.com/HKUDS/nanobot
175•ms7892•10h ago•96 comments

Fela Kuti First African to Get Grammys Lifetime Achievement Award

https://www.aljazeera.com/news/2026/2/1/fela-kuti-becomes-first-african-to-get-grammys-lifetime-a...
73•defrost•4d ago•18 comments

A Broken Heart

https://allenpike.com/2026/a-broken-heart/
130•memalign•4d ago•34 comments

Programming Patterns: The Story of the Jacquard Loom

https://www.scienceandindustrymuseum.org.uk/objects-and-stories/jacquard-loom
64•andsoitis•4d ago•26 comments

The time I didn't meet Jeffrey Epstein

https://scottaaronson.blog/?p=9534
20•pfdietz•38m ago•4 comments

CIA suddenly stops publishing, removes archives of The World Factbook

https://simonwillison.net/2026/Feb/5/the-world-factbook/
177•ck2•5h ago•57 comments

Triton Bespoke Layouts

https://www.lei.chat/posts/triton-bespoke-layouts/
6•matt_d•4d ago•0 comments

Simply Scheme: Introducing Computer Science (1999)

https://people.eecs.berkeley.edu/~bh/ss-toc2.html
86•AlexeyBrin•4d ago•27 comments

Unsealed court documents show teen addiction was big tech's "top priority"

https://techoversight.org/2026/01/25/top-report-mdl-jan-25/
195•Shamar•2h ago•101 comments

Show HN: Micropolis/SimCity Clone in Emacs Lisp

https://github.com/vkazanov/elcity
131•vkazanov•11h ago•36 comments

Making Ferrite Core Inductors at Home

https://danielmangum.com/posts/making-ferrite-core-inductors-home/
93•hasheddan•3d ago•29 comments
Open in hackernews

GPT-5.3-Codex

https://openai.com/index/introducing-gpt-5-3-codex/
517•meetpateltech•1h ago

Comments

minimaxir•1h ago
I remember when AI labs coordinated so they didn't push major announcements on the same day to avoid cannibalizing each other. Now we have AI labs pushing major announcements within 30 minutes.
crorella•1h ago
The thrill of competition
zozbot234•1h ago
They're also coordinating around Chinese New Year to compete with new releases of the major open/local models.
DonHopkins•1h ago
Year of the Pelican!
hoeoek•1h ago
simonw?
IhateAI•1h ago
A sign of the inevitible implosion !
observationist•1h ago
The labs have fully embraced the cutthroat competition, the arms race has fully shed the civilized facade of beneficient mutual cooperation.

Dirty tricks and underhanded tactics will happen - I think Demis isn't savvy in this domain, but might end up stomping out the competition on pure performance.

Elon, Sam, and Dario know how to fight ugly and do the nasty political boardroom crap. 26 is gonna be a very dramatic year, lots of cinematic potential for the eventual AI biopics.

manquer•1h ago
>civilized facade of mutual cooperation

>Dirty tricks and underhanded tactics

As long the tactics are legal ( i.e. not corporate espionage, bribes etc), the no holds barred full free market competition is the best thing for the market and the consumers.

thethimble•53m ago
The consumers are getting huge wins.

Model costs continue to collapse while capability improves.

Competition is fantastic.

tedsanders•1h ago
This goes way back. When OpenAI launched GPT-4 in 2023, both Anthropic and Google lined up counter launches (Claude and Magic Wand) right before OpenAI's standard 10am launch time.
manquer•1h ago
Wouldn't that be illegal ? i.e. cartel to collude like that ?
cedws•1h ago
I wish they’d just stop pretending to care about safety, other than a few researchers at the top they care about safety only as long as they aren’t losing ground to the competition. Game theory guarantees the AI labs will do what it takes to ensure survival. Only regulation can enforce the limits, self policing won’t work when money is involved.
thethimble•54m ago
As long as China continues to blitz forward, regulation is a direct path to losing.
pixl97•34m ago
You mean all paths are direct paths to losing.
vovavili•5m ago
The last thing I would want is for excessively neurotic bureaucrats to interfere with all the mind-blowing progress we've had in the last couple of years with LLM technology.
granzymes•1h ago
I think Anthropic rushed out the release before 10am this morning to avoid having to put in comparisons to GPT-5.3-codex!

The new Opus 4.6 scores 65.4 on Terminal-Bench 2.0, up from 64.7 from GPT-5.2-codex.

GPT-5.3-codex scores 77.3.

__jl__•1h ago
Impressive jump for GPT-5.3-codex and crazy to see two top coding models come out on the same day...
granzymes•1h ago
Insane! I think this has to be the shortest-lived SOTA for any model so far. Competition is amazing.
the_duke•1h ago
I do not trust the AI benchmarks much, they often do not line up with my experience.

That said ... I do think Codex 5.2 was the best coding model for more complex tasks, albeit quite slow.

So very much looking forward to trying out 5.3.

NitpickLawyer•1h ago
Just some anecdata++ here but I found 5.2 to be really good at code review. So I can have something crunched by cheaper models, reviewed async by codex and then re-prompt with the findings from the review. It finds good things, doesn't flag nits (if prompted not to) and the overall flow is worth it for me. Speed loss doesn't impact this flow that much.
kilroy123•1h ago
Personally, I have Claude do the coding. Then 5.2-high do the reviewing.
seunosewa•38m ago
Then I pass the review back to Claude Opus to implement it.
VladVladikoff•25m ago
Just curious is this a manual process or you guys have automated these steps?
StephenHerlihyy•20m ago
I don’t use OpenAI too much, but I follow a similar work flow. Use Opus for design/architecture work. Move it to Sonnet for implementation and build out. Then finally over to Gemini for review, QC and standards check. There is an absolute gain in using different models. Each has their own style and way of solving the problem just like a human team. It’s kind of awesome and crazy and a bit scary all at once.
jahsome•1h ago
Another day, another hn thread of "this model changes everything" followed immediately by a reply stating "actually I have the literal opposite experience and find competitor's model is the best" repeated until it's time to start the next day's thread.
malshe•1h ago
This pretty accurately summarizes all the long discussions about AI models on HN.
wasmainiac•1h ago
All anonymous as well. Who are making these claims? script kiddies? sr devs? Altman?
BoredPositron•1h ago
When you keep his ramblings on twitter or company blog in mind I bet he is a shit poster here.
locknitpicker•1h ago
> All anonymous as well. Who are making these claims? script kiddies? sr devs? Altman?

You can take off your tinfoil hat. The same models can perform differently depending on the programming language, frameworks and libraries employed, and even project. Also, context does matter, and a model's output greatly varies depending on your prompt history.

nocman•59m ago
> Who are making these claims? script kiddies? sr devs? Altman?

AI agents, perhaps? :-D

clhodapp•36m ago
And of course the benchmarks are from the school of "It's better to have a bad metric than no metric", so there really isn't any way to falsify anyone's opinions...
cactusplant7374•35m ago
Hourly occurrence on /r/codex. Model astrology is about the vibes.
StephenHerlihyy•17m ago
What amazes me the most is the speed at which things are advancing. Go back a year or even a year before that and all these incremental improvements have compounded. Things that used to require real effort to consistently solve, either with RAGs, context/prompt engineering, have become… trivial. I totally agree with your point that each step along the way doesn’t necessarily change that much. But in the aggregate it’s sort of insane how fast everything is moving.
SatvikBeri•7m ago
I use Claude Code every day, and I'm not certain I could tell the difference between Opus 4.5 and Opus 4.0 if you gave me a blind test
fooker•1h ago
Yeah, these benchmarks are bogus.

Every new model overfits to the latest overhyped benchmark.

Someone should take this to a logical extreme and train a tiny model that scores better on a specific benchmark.

mrandish•32m ago
> Yeah, these benchmarks are bogus.

It's not just over-fitting to leading benchmarks, there's also too many degrees of freedom in how a model is tested (harness, etc). Until there's standardized documentation enabling independent replication, it's all just benchmarketing .

fooker•19m ago
For the current state of AI, the harness is unfortunately part of the secret sauce.
aurareturn•1h ago
5.2 Codex became my default coding model. It “feels” smarter than Opus 4.5.

I use 5.2 Codex for the entire task, then ask Opus 4.5 at the end to double check the work. It's nice to have another frontier model's opinion and ask it to spot any potential issues.

Looking forward to trying 5.3.

koakuma-chan•1h ago
Opus 4.5 is more creative and better at making UIs
nerdsniper•27m ago
Opus 4.5 still worked better for most of my work, which is generally "weird stuff". A lot of my programming involves concepts that are a bit brain-melting for LLMs, because multiple "99% of the time, assumption X is correct" are reversed for my project. I think Opus does better at not falling into those traps. Excited to try out 5.3
nurettin•1h ago
Opus was quite useless today. Created lots of globals, statics, forward declarations, hidden implementations in cpp files with no testable interface, erasing types, casting void pointers, I had to fix quite a lot and decouple the entangled mess.

Hopefully performance will pick up after the rollout.

leumon•48m ago
they tested it at xhigh reasoning though, which is probably double the cost of Anthropic's model.

Cost to Run Artificial Analysis Intelligence Index:

GPT-5.2 Codex (xhigh): $3244

Claude Opus 4.5-reasoning: $1485

(and probably similar values for the newer models?)

redox99•34m ago
With $20 gpt plan you can use xhigh no problem. With $20 Claude plan you reach the 5h limit with a single feature.
wilg•36m ago
In my personal experience the GPT models have always been significantly better than the Claude models for agentic coding, I’m baffled why people think Claude has the edge on programming.
dudeinhawaii•10m ago
I think for many/most programmers = 'speed + output' and webdev == "great coding".

Not throwing shade anyone's way. I actually do prefer Claude for webdev (even if it does cringe things like generate custom CSS on every page) -- because I hate webdev and Claude designs are always better looking.

But the meat of my code is backend and "hard" and for that Codex is always better, not even a competition. In that domain, I want accuracy and not speed.

Solution, use both as needed!

jronak•33m ago
Did you look at the ARC AGI 2? Codex might be overfit for terminal bench
tedsanders•26m ago
ARC AGI 2 has a training set that model providers can choose to train on, so really wouldn't recommend using it as a general measure of coding ability.
fishpham•1h ago
Model card: https://cdn.openai.com/pdf/23eca107-a9b1-4d2c-b156-7deb4fbc6...
raincole•1h ago
Almost like Anthropic and OpenAI are trying to front run each other
kingstnap•1h ago
That was fast!

I really do wonder whats the chain here. Did Sam see the Opus announcement and DM someone a minute later?

Mond_•1h ago
OpenAI has a whole history of trying to scoop other providers. This was a whole thing for Google launches, where OpenAI regularly launched something just before Google to grab the media attention.
rsanek•1h ago
Some recent examples:

GPT-4o vs. Google I/O (May 2024): OpenAI scheduled its "Spring Update" exactly 24 hours before Google’s biggest event of the year, Google I/O. They launched GPT-4o voice mode.

Sora vs. Gemini 1.5 Pro (Feb 2024): Just two hours after Google announced its breakthrough Gemini 1.5 Pro model, Sam Altman tweeted the reveal of Sora (text-to-video).

ChatGPT Enterprise vs. Google Cloud Next (Aug 2023): As Google began its major conference focused on selling AI to businesses, OpenAI announced ChatGPT Enterprise.

maxpert•1h ago
Tell me that you are hurt without telling me that you are hurt this applies to Sam right now
NewsaHackO•16m ago
I assume some sort of corporate espionage. This is high stakes after all
maheshrijal•1h ago
It seems Fast!
simianwords•1h ago
Any notes on pricing?
Tiberium•1h ago
It's not in the API yet - "We are working to safely enable API access soon.", but I assume the rate-limits won't be worse than for 5.2 Codex.
nine_k•1h ago
Ah, "It's ready, but not yet".
yunyu•1h ago
You can just use it outside of the API?
kingstnap•1h ago
> GPT‑5.3-Codex was co-designed for, trained with, and served on NVIDIA GB200 NVL72 systems. We are grateful to NVIDIA for their partnership.

This is hilarious lol

uh_uh•1h ago
How so?
Philpax•1h ago
They're on shaky ground right now https://arstechnica.com/information-technology/2026/02/five-...
kingstnap•1h ago
Its kind of a suck up that more or less confirms the beef stories that were floating around this past week.

In case you missed it. For example:

Nvidia's $100 billion OpenAI deal has seemingly vanished - Ars Technica

https://arstechnica.com/information-technology/2026/02/five-...

Specifically this paragraph is what I find hilarious.

> According to the report, the issue became apparent in OpenAI’s Codex, an AI code-generation tool. OpenAI staff reportedly attributed some of Codex’s performance limitations to Nvidia’s GPU-based hardware.

esafak•1h ago
> OpenAI staff reportedly attributed some of Codex’s performance limitations to Nvidia’s GPU-based hardware.

They should design their own hardware, then. Somehow the other companies seem to be able to produce fast-enough models.

dajonker•9m ago
There was never a $100 billion deal. Only a letter of intent which doesn't mean anything contractually.
shibeprime•1h ago
I know we just got a reset and a 2× bump with the native app release, but shipping 5.3 with no reset feels mismatched. If I’d known this was coming, I wouldn’t have used up the quota on the previous model.
trilogic•1h ago
When 2 multi billion giants advertise same day, it is not competition but rather a sign of struggle and survival. With all the power of the "best artificial intelligence" at your disposition, and a lot of capital also all the brilliant minds, THIS IS WHAT YOU COULD COME UP WITH?

Interesting

rishabhaiover•1h ago
What happened to you?
raincole•1h ago
AI fried brains, unfortunately.
wasmainiac•1h ago
I mean, he has a point it’s just not very eloquently written.
trilogic•1h ago
I empathize with the situation, no elegance from them, no eloquence from me :)
sdf2erf•1h ago
Yeah they are both fighting for survival. No surprise really.

Need to keep the hype going if they are both IPO'ing later this year.

superze•1h ago
How many IPOs can a company really do?
re-thc•50m ago
As many as they want. They can "spin off" and then "merge" again.
thethimble•49m ago
The AI market is an infinite sum market.

Consider the fact that 7 year old TPUs are still sitting at near 100p utilization today.

lossolo•1h ago
What's funny is that most of this "progress" is new datasets + post-training shaping the model's behavior (instruction + preference tuning). There is no moat besides that.
Davidzheng•1h ago
"post-training shaping the models behavior" it seems from your wording that you find it not that dramatic. I rather find the fact that RL on novel environments providing steady improvements after base-model an incredibly bullish signal on future AI improvements. I also believe that the capability increase are transferring to other domains (or at least covers enough domains) that it represents a real rise in intelligence in the human sense (when measured in capabilities - not necessarily innate learning ability)
WarmWash•1h ago
>There is no moat besides that.

Compute.

Google didn't announce $185 billion in capex to do cataloguing and flash cards.

causalmodels•56m ago
Google didn't buy 30% of Anthropic to starve them of compute
WarmWash•45m ago
Probably why it's selling them TPUs.
edem•1h ago
So can I use this from Opencode? Because Anthropic started to enforce their TOS to kill the Opencode integration
rs_rs_rs_rs_rs•1h ago
You can use Anthropic models in Opencode, make an api key and you're good to do(you can even use the in house Opencode router, Zen).

What you can't do is pretend opencode is claude code to make use of that specific claude code subscription.

tfehring•1h ago
OpenAI models in general, yes - `opencode auth login`, select OpenAI, then ChatGPT Pro/Plus. I just checked and 5.3-codex isn't available in opencode yet, but I assume it will be soon.
InsideOutSanta•1h ago
Yes, OpenAI said they'd allow usage of their subscriptions in opencode.
regularfry•1h ago
I've tried opus 4.5 in opencode via the GitHub Copilot API, mostly to see if it works all. I don't think that broke any terms of service? But also I haven't checked how much more expensive I made it for myself over just calling them directly.
Robin_f•1h ago
Anthropic mostly had an advantage in speed. It feels like with a 25% increase in speed with Codex 5.3, they are now losing that advantage as well.
smith7018•39m ago
I just asked Opus 4.6 to debug a bug in my current changes and it went for 20 minutes before I interrupted it. Take that as you will.
bgirard•9m ago
Doesn't feel like a useful data point without more context. For some hard bugs I'd be thrilled to wait 30 minutes for a fix, for a trivial CSS fix not so much. I've spent weeks+ of my career fix single bugs. Context is everything.
maxpert•1h ago
Is this me or Sam is being absolute sore loser he is and trying to steal Opus thunder?
OutOfHere•1h ago
They both are doing this to each other.

BTW, loser is spelled with a single o.

wahnfrieden•1h ago
You could also claim that Anthropic is trying to scoop OpenAI by launching minutes earlier, as OpenAI has done with Google in the past.

For downvoters, you must be naive to think these companies are not surveilling each other through various means.

nickthegreek•1h ago
Why is it loser? He very well could be a sore winner here.
koakuma-chan•1h ago
OpenAI is still the only AI company that has structured outputs. Anthropic now supports JSON schema but you can't specify array length.
wahnfrieden•9m ago
Can you elaborate what you mean - OAI structured outputs means JSON schema doesn't it? So are you just saying they both support JSON schema but Anthropic has a limitation?
binsquare•1h ago
At first try it solved a problem that 5.2 couldn't previously.

Seems to be slower/thinks longer.

OutOfHere•1h ago
It is absurd to release 5.3-Codex before first releasing 5.3.

Also, there is no reason for OpenAI and Anthropic to be trying to one-up each other's releases on the same day. It is hell for the reader.

apetresc•1h ago
Why is it absurd?
tomashubelbauer•1h ago
I agree, I was confused about where 5.3 non Codex was. 5.2-Codex disappointed me enough that I won't be giving 5.3 Codex a try, but I'm looking forward to trying 5.3 non Codex with Pi.
sunaookami•5m ago
GPT-5.x in general are very disappointing, the only good chat model was GPT-5 in the first week before they made "the personality warmer" and Codex in general was always kinda meh.
aurareturn•1h ago
Because Claude Code is stealing the thunder so OpenAI is focusing on coding now.
whizzter•1h ago
Yeah, Claude Code is what everyone is talking about these days and since OpenAI has always been the spending driver being 2nd or 3rd fiddle just isn't acceptable if they're gonna justify it.
modeless•1h ago
It's so difficult to compare these models because they're not running the same set of evals. I think literally the only eval variant that was reported for both Opus 4.6 and GPT-5.3-Codex is Terminal-Bench 2.0, with Opus 4.6 at 65.4% and GPT-5.3-Codex at 77.3%. None of the other evals were identical, so the numbers for them are not comparable.
input_sh•1h ago
It's better on a benchmark I've never heard of!? That is groundbreaking, I'm switching immediately!
modeless•1h ago
I also wasn't that familiar with it, but the Opus 4.6 announcement leaned pretty heavily on the TerminalBench 2.0 score to quantify how much of an improvement it was for coding, so it looks pretty bad for Anthropic that OpenAI beat them on that specific benchmark so soundly.

Looking at the Opus model card I see that they also have by far the highest score for a single model on ARC-AGI-2. I wonder why they didn't advertise that.

input_sh•1h ago
No way! Must be a coinkydink, no way OpenAI knew ahead of time that Anthropic was gonna put a focus on that specific useless benchmark as opposed to all the other useless benchmarks!?

I'm firing 10 people now instead of 5!

rsanek•1h ago
I usually wait to see what ArtificialAnalysis says for a direct comparison.
alexhans•1h ago
Isn't the best eval the one you build yourself, for your own use cases and value production?

I encourage people to try. You can even timebox it and come up with some simple things that might look initially insufficient but that discomfort is actually a sign that there's something there. Very similar to moving from not having unit/integration tests for design or regression and starting to have them.

xiphias2•1h ago
,,GPT‑5.3-Codex is the first model we classify as High capability for cybersecurity-related tasks under our Preparedness Framework , and the first we’ve directly trained to identify software vulnerabilities. While we don’t have definitive evidence it can automate cyber attacks end-to-end, we’re taking a precautionary approach and deploying our most comprehensive cybersecurity safety stack to date. Our mitigations include safety training, automated monitoring, trusted access for advanced capabilities, and enforcement pipelines including threat intelligence.''

While I love Codex and believe it's amazing tool, I believe their preparedness framework is out of date. As it is more and more capable of vibe coding complex apps, it's getting clear that the main security issues will come up by having more and more security critical software vibe coded.

It's great to look at systems written by humans and how well Codex can be used against software written by humans, but it's getting more important to measure the opposite: how well humans (or their own software) are able to infiltrate complex systems written mostly by Codex, and get better on that scale.

In simpler terms: Codex should write secure software by default.

mrkeen•1h ago
Is "high-capability" a stronger or weaker claim than "team of phd-level experts"?

https://www.nbcnews.com/tech/tech-news/openai-releases-chatg...

trcf23•56m ago
That’s just classical OpenAI trying to make us believe they’re closing on AGI… Like all « so called » research from them and Anthropic about safety alignment and that their tech is so incredibly powerful that guardrails should be put on them.
ActionHank•29m ago
I heard the other day that every time someone claps another vibe coded project embeds the api keys in the webpage.

I wonder if this will continue to be the case.

da_grift_shift•16m ago
>Our mitigations include safety training, automated monitoring, trusted access for advanced capabilities, and enforcement pipelines including threat intelligence.

"We added some more ACLs and updated our regex"

mannanj•1h ago
Stolen from the Opus 4.6 thread:

GPT-5.3-Codex was so good it became my wife!

GenerWork•1h ago
I find it very, very interesting how they demoed visuals in the form of the “soft SaaS” website and mentioned how it can do user research. Codex has usually lagged behind Claude and Gemini when it comes to UX, so I’m curious to see if 5.3 will take the lead in real world use. Perhaps it’ll be available in Figma Make now?
brokencode•1h ago
I’m hoping they add better IDE integration to track active file and selection. That’s the biggest annoyance I have in working with Codex.
imasliev•1h ago
GPT-5.2-Codex was so cool at price/value rate, hope 5.3 will not ruin the race with claude
I_am_tiberius•1h ago
I'd like to know if and how much illegal use of customer prompts are used for training.
renewiltord•1h ago
Oh yeah that’s in the “These Are The Illegal Things We Did” section 7.4 in the Model Card.
xlbuttplug2•47m ago
"But we anonymize prompts before training!"

Meanwhile the prompt: Crop this photo of my passport

wahnfrieden•1h ago
Pelican seems much worse than the Opus 4.6 one (though the bicycle is more accurate):

https://gist.github.com/simonw/a6806ce41b4c721e240a4548ecdbe...

morleytj•1h ago
The behind the scenes on deciding when to release these models has got to be pretty insanely stressful if they're coming out within 30 minutes-ish of each other.
Havoc•1h ago
It’s also functionally not likely without some sort of insider knowledge or coordination
morleytj•1h ago
Could be, could also be situations where things are lined up to launch in the near future and then a mad dash happens upon receiving outside news of another launch happening.

I suppose coincidences happen too but that just seems too unlikely to believe honestly. Some sort of knowledge leakage does seem like the most likely reason.

meisel•57m ago
I wonder if their "5.3" was continuously being updated, with regenerated benchmarks with each improvement, and they just stayed ready to release it when claude released
heraldgeezer•1h ago
Anthropic and GTP 2 new models at once?
tosh•1h ago
Terminal Bench 2.0

  | Name                | Score |
  |---------------------|-------|
  | OpenAI Codex 5.3    | 77.3  |
  | Anthropic Opus 4.6  | 65.4  |
greenfish6•1h ago
yea but i feel like we are over the hill on benchmaxxing, many times a model has beaten anthropic on a specific bench, but the 'feel' is that it is still not as good at coding
AstroBen•1h ago
'feel' is no more accurate

not saying there's a better way but both suck

thethimble•51m ago
Speak for yourself. I've been insanely productive with Codex 5.2.

With the right scaffolding these models are able to perform serious work at high quality levels.

AstroBen•50m ago
..huh?
helloplanets•33m ago
He wasn't saying that both of the models suck, but that the heuristics for measuring model capability suck
crorella•50m ago
The variety of tasks they can do and will be asked to do is too wide and dissimilar, it will be very hard to have a transversal measurement, at most we will have area specific consensus that model X or Y is better, it is like saying one person is the best coder at everything, that does not exist.
pixl97•29m ago
Yea, we're going to need benchmarks that incorporate series of steps of development for a particular language and how good each model is at it.

Like can the model take your plan and ask the right questions where there appear to be holes.

How wide of architecture and system design around your language does it understand.

How does it choose to use algorithms available in the language or common libraries.

How often does it hallucinate features/libraries that aren't there.

How does it perform as context get larger.

And that's for one particular language.

tavavex•33m ago
The 'feel' of a single person is pretty meaningless, but when many users form a consensus over time after a model is released, it feels a lot more informative than a simple benchmark because it can shift over time as people individually discover the strong and weak points of what they're using and get better at it.
forrestthewoods•27m ago
At the end of the day “feel” is what people rely on to pick which tool they use.

I’d feel unscientific and broken? Sure maybe why not.

But at the end of the day I’m going to choose what I see with my own two eyes over a number in a table.

Benchmarks are a sometimes useful to. But we are in prime Goodharts Law Territory.

AstroBen•23m ago
yeah, to be honest it probably doesn't matter too much. I think the major models are very close in capabilities
falloutx•34m ago
When Anthropic beats Benchmarks its somehow earned, when OpenAi games it, its somehow about not feeling good at coding.
karmasimida•10m ago
Your feeling is not my feeling, codex is unambiguously smarter model for me
jdthedisciple•1h ago
Gotta love how the game demo's page title is "threejs" – I guess the point was to demo its vibe-coding abilities anyway, but yea..
prng2021•1h ago
Did they post the knowledge cutoff date somewhere
gwd•1h ago
gpt-5.3-codex isn't available on the API yet. From TFA:

> We are working to safely enable API access soon.

ecshafer•1h ago
Funny that this and Opus 4.6 released within minutes of each other. Each showing similar score improvements. Each claiming to be revolutionary.
__mharrison__•1h ago
I never really used Codex (found it to slow) just 5.2, which I going to be an excellent model for my work. This looks like another step up.

This week, I'm all local though, playing with opencode and running qwen3 coder next on my little spark machine. With the way these local models are progressing, I might move all my llm work locally.

andix•43m ago
I think codex got much faster for smaller tasks in the last few months. Especially if you turn thinking down to medium.
foft•1h ago
Having used codex a fair bit I find it really struggles with … almost anything. However using the equivalent chat gpt model is fantastic. I guess it’s a matter of focus and being provided with a smaller set of code to tackle.
ffitch•1h ago
> our team was blown away > by how much Codex was able > to accelerate its own development

they forgot to add “Can’t wait to see what you do with it”

roya51788•1h ago
what are the benchmarks against opus 4.6?
itay-maman•1h ago
Something that caught my eye from the announcement:

> GPT‑5.3‑Codex is our first model that was instrumental in creating itself. The Codex team used early versions to debug its own training

I'm happy to see the Codex team moving to this kind of dogfooding. I think this was critical for Claude Code to achieve its momentum.

aurareturn•1h ago
More importantly, this is the early steps of a model self improving itself.

Do we still think we'll have soft take off?

reducesuffering•1h ago
This has already been going on for years. It's just that they were using GPT 4.5 to work on GPT 5. All this announcement mean is that they're confident enough in early GPT 5.3 model output to further refine GPT 5.3 based on initial 5.3. But yes, takeoff will still happen because of this recursive self improvement works, it's just that we're already past the inception point.
mirsadm•31m ago
I can't tell if this is a serious conversation anymore.
quinncom•46m ago
Exponential growth may look like a very slow increase at first, but it's still exponential growth.
thrance•23m ago
I think the limiting factor is capital, not code. And I doubt GPTX is anymore competent at raising funds than the other, fleshy, snake oilers...
mrandish•18m ago
> Do we still think we'll have soft take off?

There's still no evidence we'll have any take off. At least in the "Foom!" sense of LLMs independently improving themselves iteratively to substantial new levels being reliably sustained over many generations.

To be clear, I think LLMs are valuable and will continue to significantly improve. But self-sustaining runaway positive feedback loops delivering exponential improvements resulting in leaps of tangible, real-world utility is a substantially different hypothesis. All the impressive and rapid achievements in LLMs to date can still be true while major elements required for Foom-ish exponential take-off are still missing.

aaaalone•17m ago
I'm only saying no to keep optimistic tbh

It feels crazy to just say we might see a fundamental shift in 5 years.

But the current addition to compute and research etc. def goes in this direction I think.

ponyous•1h ago
I think models are smart enough for most of the stuff, these little incremental changes barely matter now. What I want is the model that is fast.
hubraumhugo•1h ago
Anybody else not seeing it available in Codex app or CLI yet (with Plus)?
haneul•58m ago
My codex CLI didn’t notice version bump available, but I manually did pnpm add -g @openai/codex and 5.3 was there after.
kopollo•1h ago
Where is the google?
hsaliak•13m ago
gemini-3-flash-preview will be GA soon i hope. /s
rustyhancock•58m ago
Anyone remember the dot-com era when you would see one provider claim the most miles of fibre and then later that week another would have the title?
dawidg81•58m ago
May AI not write the code for me.

May I at least understand what it has "written". AI help is good but don't replace real programmers completely. I'm enough copy pasting code i don't understand. What if one day AI will fall down and there will be no real programmers to write the software. AI for help is good but I don't want AI to write whole files into my project. Then something may broke and I won't know what's broken. I've experienced it many times already. Told the AI to write something for me. The code was not working at all. It was compiling normally but the program was bugged. Or when I was making some bigger project with ChatGPT only, it was mostly working but after a longer time when I was promting more and more things, everything got broken.

katspaugh•56m ago
Honest question: have you tried evolving your code architecture when adding features instead of just "promting more and more things"?
pixl97•24m ago
> What if one day AI will fall down and there will be no real programmers to write the software.

What if you want to write something very complex now that most people don't understand? You keep offering more money until someone takes the time to learn it and accomplish it, or you give up.

I mean, there are still people that hammer out horseshoes over a hot fire. You can get anything you're willing to pay money for.

bg24•50m ago
I am on a max subscription for Claude, and hate the fact that OpenAI have not figured out that $20 => $200 is a big jump. Good luck to them. In terms of model, just last night, Codex 5.2 solved a problem for me which other models were going round and round. Almost same instructions. That said, I still plan to be on $100 Claude (overall value across many tasks, ability to create docs, co-work), and may bump up OpenAI subscription to the next tier should they decide to introduce one. Not going to $200 even with 5.3, unless my company pays for it.
andix•45m ago
I guess the jump is on purpose. You can buy Codex credits and also use codex via the API (manual switching required).
bryanhogan•48m ago
The most important question: Can it do Svelte now?
speedgoose•32m ago
Today is the best day to rewrite everything in React. You may not enjoy React, but AI agents do. And they are the ones writing the code.
davidmurdoch•14m ago
5.2 was already very good with svelte 5, at least when you have the svelte MCP server set up.
davidmurdoch•47m ago
I've been using 5.2 the way they're describing the new use case for 5.3 this whole time.
tyfon•47m ago
I'm having a hard time parsing the openai website.

Anyone know if it is possible to use this model with opencode with the plus subscription?

karmasimida•22m ago
For those who cared:

GPT-5.3-Codex dominates terminal coding with a roughly 12% lead (Terminal-Bench 2.0), while Opus 4.6 retains the edge in general computer use by 8% (OSWorld).

Anyone knows the difference between OSWorld vs OSWorld Verified?

bgirard•18m ago
> Using the develop web game skill and preselected, generic follow-up prompts like "fix the bug" or "improve the game", GPT‑5.3-Codex iterated on the games autonomously over millions of tokens.

I wish they would share the full conversation, token counts and more. I'd like to have a better sense of how they normalize these comparisons across version. Is this a 3-prompt 10m token game? a 30-prompt 100m token game? Are both models using similar prompts/token counts?

I vibe coded a small factorio web clone [1] that got pretty far using the models from last summer. I'd love to compare against this.

[1] https://factory-gpt.vercel.app/

veb•13m ago
I just wanted to say that's a pretty cool demo! I hadn't realised people were using it for things like this.
bgirard•4m ago
Thank you. There's a demo game to get the full feel of it quickly. There's also a 2D-ASCII and 3D render you can hotswap between. The 3D models are generated with Meshy. The entire game is 'AI slop'. I intentionally did no code reviews to see where that would get me. Some prompts were very specific but other prompts were just 'add a research of your choice'.

This was built using old versions of Codex, Gemini and Claude. I'll probably work on it more soon to try the latest models.

nananana9•17m ago
I've been listening to the insane 100x productivity gains you all are getting with AI and "this new crazy model is a real game changer" for a few years now, I think it's about time I asked:

Can you guys point me ton a single useful, majority LLM-written, preferably reliable, program that solves a non-trivial problem that hasn't been solved before a bunch of times in publicly available code?

eviks•14m ago
Great question, here is the link from the future:
beernet•13m ago
Can you point me to a human written program an LLM cannot write? And no, just answering with a massively large codebase does not count because this issue is temporary.

Some people just hate progress.

PieUser•13m ago
How'd they both release at the same time? Insiders?
Rperry2174•6m ago
Whats interesting to me is that these gpt-5.3 and opus-4.6 are diverging philosophically and really in the same way that actual engineers and orgs have diverged philosophically

With Codex (5.3), the framing is an interactive collaborator: you steer it mid-execution, stay in the loop, course-correct as it works.

With Opus 4.6, the emphasis is the opposite: a more autonomous, agentic, thoughtful system that plans deeply, runs longer, and asks less of the human.

that feels like a reflection of a real split in how people think llm-based coding should work...

some want tight human-in-the-loop control and others want to delegate whole chunks of work and review the result

Interested to see if we eventually see models optimize for those two philosophies and 3rd, 4th, 5th philosophies that will emerge in the coming years.

Maybe it will be less about benchmarks and more about different ideas of what working-with-ai means