GPT-5.4

https://openai.com/index/introducing-gpt-5-4/

220•mudkipdev•1h ago

https://openai.com/index/gpt-5-4-thinking-system-card/

https://x.com/OpenAI/status/2029620619743219811

Comments

ignorantguy•2h ago

it shows a 404 as of now.

minimaxir•1h ago

Up now.

The OP has frequently gotten the scoop for new LLM releases and I am curious what their pipeline is.

Leynos•1h ago

Guess the URL and post at 10 AM PST on the day of release.

bdangubic•1h ago

curl the URL https://openai.com/index/introducing-gpt-5-? until you get 200

mudkipdev•1h ago

Probably refresh the api models list every couple minutes instead. No one could have guessed the name of GPT-Codex-Spark

mattas•1h ago

"GPT‑5.4 interprets screenshots of a browser interface and interacts with UI elements through coordinate-based clicking to send emails and schedule a calendar event."

They show an example of 5.4 clicking around in Gmail to send an email.

I still think this is the wrong interface to be interacting with the internet. Why not use Gmail APIs? No need to do any screenshot interpretation or coordinate-based clicking.

TheAceOfHearts•1h ago

I think the desire is that in the long-term AI should be able to use any human-made application to accomplish equivalent tasks. This email demo is proof that this capability is a high priority.

spongebobstoes•1h ago

not everything has an API, or API use is limited. some UIs are more feature complete than their APIs

some sites try to block programmatic use

UI use can be recorded and audited by a non-technical person

Jacques2Marais•1h ago

I guess a big chunk of their target market won't know how to use APIs.

satvikpendem•1h ago

The ideal of REST, the HTML and UI is the API.

PaulHoule•1h ago

APIs have never been a gift but rather have always been a take-away that lets you do less than you can with the web interface. It’s always been about drinking through a straw, paying NASA prices, and being limited in everything you can do.

But people are intimidated by the complexity of writing web crawlers because management has been so traumatized by the cost of making GUI applications that they couldn’t believe how cheap it is to write crawlers and scrapers…. Until LLMs came along, and changed the perceived economics and created a permission structure. [1]

AI is a threat to the “enshittification economy” because it lets us route around it.

[1] that high cost of GUI development is one reason why scrapers are cheap… there is a good chance that the scraper you wrote 8 years ago still works because (a) they can’t afford to change their site and (b) if they could afford to change their site changing anything substantial about it is likely to unrecoverably tank their Google rankings so they won’t. A.I. might change the mechanics of that now that you Google traffic is likely to go to zero no matter what you do.

disqard•1h ago

> AI is a threat to the “enshittification economy” because it lets us route around it.

This is prescient -- I wonder if the Big Tech entities see it this way. Maybe, even if they do, they're 100% committed to speedrunning the current late-stage-cap wave, and therefore unable to do anything about it.

PaulHoule•26m ago

They are not a single thing.

Google has a good model in the form of Gemini and they might figure they can win the AI race and if the web dies, the web dies. YouTube will still stick around.

Facebook is not going to win the AI race with low I.Q. Llama but Zuck believed their business was cooked around the time it became a real business because their users would eventually age out and get tired of it. If I was him I'd be investing in anything that isn't cybernetic let it be gold bars or MMA studios.

Microsoft? They bought Activision for $69 billion. I just can't explain their behavior rationally but they could do worse than their strategy of "put ChatGPT in front of laggards and hope that some of them rise to the challenge and become slop producers."

Amazon is really a bricks-and-mortar play which has the freedom to invest in bricks-and-mortar because investors don't think they are a bricks-and-mortar play.

Netflix? They're cooked as is all of Hollywood. Hollywood's gatekeeping-industrial strategy of producing as few franchise as possible will crack someday and our media market may wind up looking more like Japan, where somebody can write a low-rent light novel like

https://en.wikipedia.org/wiki/Backstabbed_in_a_Backwater_Dun...

and J.C. Staff makes a terrible anime that convinces 20k Otaku to drop $150 on the light novels and another $150 on the manga (sorry, no way you can make a balanced game based on that premise!) and the cost structure is such that it is profitable.

lostmsu•1h ago

> AI is a threat to the “enshittification economy” because it lets us route around it.

I am not sure about that. We techies avoid enshittification because we recognize shit. Normies will just get their syncopatic enshittified AI that will tell them to continue buying into walled gardens.

steve1977•1h ago

One could argue that LLMs learning programming languages made for humans (i.e. most of them) is using the wrong interface as well. Why not use machine code?

embedding-shape•1h ago

Why would human language by the wrong interface when they're literally language models? Why would machine code be better when there is probably magnitude less of training material with machine code?

You can also test this yourself easily, fire up two agents, ask one to use PL meant for humans, and one to write straight up machine code (or assembly even), and see which results you like best.

BoredPositron•1h ago

because they are inherently text based as is code?

steve1977•1h ago

But they are abstractions made to cater to human weaknesses.

jstummbillig•1h ago

Because the web and software more generally if full of not APIs and you do, in fact, need the clicking to work to make agents work generally

modeless•1h ago

A world where AIs use APIs instead of UIs to do everything is a world where us humans will soon be helpless, as we'll have to ask the AIs to do everything for us and will have limited ability to observe and understand their work. I prefer that the AIs continue to use human-accessible tools, even if that's less efficient for them. As the price of intelligence trends toward zero, efficiency becomes relatively less important.

npilk•1h ago

It feels like building humanoid robots so they can use tools built for human hands. Not clear if it will pay off, but if it does then you get a bunch of flexibility across any task "for free".

Of course APIs and CLIs also exist, but they don't necessarily have feature parity, so more development would be needed. Maybe that's the future though since code generation is so good - use AI to build scaffolding for agent interaction into every product.

coffeemug•1h ago

A model that gets good at computer use can be plugged in anywhere you have a human. A model that gets good at API use cannot. From the standpoint of diffusion into the economy/labor market, computer use is much higher value.

f0e4c2f7•58m ago

Lots of services have no desire to ever expose an API. This approach lets you step right over that.

If an API is exposed you can just have the LLM write something against that.

denysvitali•1h ago

Article: https://openai.com/index/introducing-gpt-5-4/

gpt-5.4

Input: $2.50 /M tokens

Cached: $0.25 /M tokens

Output: $15 /M tokens

---

gpt-5.4-pro

Input: $30 /M tokens

Output: $180 /M tokens

Wtf

elliotbnvl•1h ago

Looks like it's an order of magnitude off. Missprint?

GenerWork•1h ago

Looks like an extra zero was added?

benlivengood•1h ago

Government pricing :)

outside2344•8m ago

$30 per kill approval

glerk•1h ago

Looks like fair price discovery :)

dpoloncsak•1h ago

>" GPT‑5.4 is priced higher per token than GPT‑5.2 to reflect its improved capabilities"

That's just not how pricing is supposed to work...? Especially for a 'non-profit'. You're charging me more so I know I have the better model?

elicash•1h ago

Can't you continue to use to older model, if you prefer the pricing?

But they also claim this new model uses fewer tokens, so it still might ultimately be cheaper even if per token cost is higher.

dpoloncsak•1h ago

I'm not against the pricing, just seems uncommon to frame it in the way they did, as opposed to the usual 'assume the customer expects more performance will cost more'

I guess they have to sell to investors that the price to operate is going down, while still needing more from the user to be sustainable

jbellis•19m ago

You can, until they turn it off.

Anthropic is pulling the plug on Haiku 3 in a couple months, and they haven't released anything in that price range to replace it.

FergusArgyll•1h ago

Maybe it's finally a bigger pretrain?

dpoloncsak•1h ago

I feel like that would have been highlighted then. "As this is a bigger pretrain, we have to raise prices".

They're framing it pretty directly "We want you to think bigger cost means better model"

minimaxir•1h ago

The marquee feature is obviously the 1M context window, compared to the ~200k other models support with maybe an extra cost for generations beyond >200k tokens. Per the pricing page, there is no additional cost for tokens beyond 200k: https://openai.com/api/pricing/

Also per pricing, GPT-5.4 ($2.50/M input, $15/M output) is much cheaper than Opus 4.6 ($5/M input, $25/M output) and Opus has a penalty for its beta >200k context window.

I am skeptical whether the 1M context window will provide material gains as current Codex/Opus show weaknesses as its context window is mostly full, but we'll see.

Per updated docs (https://developers.openai.com/api/docs/guides/latest-model), it supercedes GPT-5.3-Codex, which is an interesting move.

thehamkercat•1h ago

GPT 5.3 codex had 400K context window btw

simianwords•1h ago

Why would some one use codex instead?

embedding-shape•1h ago

Why would someone use Claude Code instead? Or any other harness? Or why only use one?

My own tooling throws off requests to multiple agents at the same time, then I compare which one is best, and continue from there. Most of the time Codex ends up with the best end results though, but my hunch is that at one point that'll change, hence I continue using multiple at the same time.

surgical_fire•1h ago

I've been using Codex for software development personally (I have a ChatGPT account), and I use Claude at work (since it is provided by my employer).

I find both Codex and Claude Opus perform at a similar level, and in some ways I actually prefer Codex (I keep hitting quota limits in Opus and have to revert back to Sonnet).

If your question is related to morality (the thing about US politics, DoD contract and so on)... I am not from the US, and I don't care about its internal politics. I also think both OpenAI and Anthropic are evil, and the world would be better if neither existed.

simianwords•1h ago

No my question was why would I use codex over gpt 5.4

surgical_fire•1h ago

Ahh, good question. I misunderstood you, apologies.

There's no mention of pricing, quotas and so on. Perhaps Codex will still be preferable for coding tasks as it is tailored for it? Maybe it is faster to respond?

Just speculation on my part. If it becomes redundant to 5.4, I presume it will be sunset. Or maybe they eventually release a Codex 5.4?

landtuna•33m ago

5.3 Codex is $1.75/$14, and 5.4 is $2.50/$15.

athrowaway3z•11m ago

They perform at a somewhat equal level on writing single files. But Codex is absolute garbage at theory of self/others. That quickly becomes frustrating.

I can tell claude to spawn a new coding agent, and it will understand what that is, what it should be told, and what it can approximately do.

Codex on the other hand will spawn an agent and then tell it to continue with the work. It knows a coding agent can do work, but doesn't know how you'd use it - or that it won't magically know a plan.

You could add more scaffolding to fix this, but Claude proves you shouldn't have to.

I suspect this is a deeper model "intelligence" difference between the two, but I hope 5.4 will surprise me.

jeswin•57m ago

When it comes to lengthy non-trivial work, codex is much better but also slower.

tedsanders•1h ago

Yeah, long context vs compaction is always an interesting tradeoff. More information isn't always better for LLMs, as each token adds distraction, cost, and latency. There's no single optimum for all use cases.

For Codex, we're making 1M context experimentally available, but we're not making it the default experience for everyone, as from our testing we think that shorter context plus compaction works best for most people. If anyone here wants to try out 1M, you can do so by overriding `model_context_window` and `model_auto_compact_token_limit`.

Curious to hear if people have use cases where they find 1M works much better!

(I work at OpenAI.)

simianwords•1h ago

Do you maybe want to give us users some hints on what to compact and throw away? In codex CLI maybe you can create a visual tool that I can see and quickly check mark things I want to discard.

Sometimes I’m exploring some topic and that exploration is not useful but only the summary.

Also, you could use the best guess and cli could tell me that this is what it wants to compact and I can tweak its suggestion in natural language.

Context is going to be super important because it is the primary constraint. It would be nice to have serious granular support.

akiselev•1h ago

> Curious to hear if people have use cases where they find 1M works much better!

Reverse engineering [1]. When decompiling a bunch of code and tracing functionality, it's really easy to fill up the context window with irrelevant noise and compaction generally causes it to lose the plot entirely and have to start almost from scratch.

(Side note, are there any OpenAI programs to get free tokens/Max to test this kind of stuff?)

[1] https://github.com/akiselev/ghidra-cli

Someone1234•14m ago

That's an interesting point regarding context Vs. compaction. If that's viewed as the best strategy, I'd hope we would see more tools around compaction than just "I'll compact what I want, brace yourselves" without warning.

Like, I'd love an optional pre-compaction step, "I need to compact, here is a high level list of my context + size, what should I junk?" Or similar.

gspetr•5m ago

I have found a bigger context window qute useful when trying to make sense of larger codebases. Generating documentation on how different components interact is better than nothing, especially if the code has poor test coverage.

I've also had it succeed in attempts to identify some non-trivial bugs that spanned multiple modules.

netinstructions•30m ago

People (and also frustratingly LLMs) usually refer to https://openai.com/api/pricing/ which doesn't give the complete picture.

https://developers.openai.com/api/docs/pricing is what I always reference, and it explicitly shows that pricing ($2.50/M input, $15/M output) for tokens under 272k

It is nice that we get 70-72k more tokens before the price goes up (also what does it cost beyond 272k tokens??)

Chance-Device•1h ago

I’m sure the military and security services will enjoy it.

varispeed•1h ago

prompt> Hi we want to build a missile, here is the picture of what we have in the yard.

mirekrusin•20m ago

    { tools: [ { name: "nuke", description: "Use when sure.", ... { lat: number, long: number } } ] }

Insanity•6m ago

Just remember an ethical programmer would never write a function “bombBagdad”. Rather they would write a function “bombCity(target City)”.

theParadox42•15m ago

The self reported safety score for violence dropped from 91% to 83%.

twtw99•1h ago

If you don't want to click in, easy comparison with other 2 frontier models - https://x.com/OpenAI/status/2029620619743219811?s=20

chabes•1h ago

Definitely don’t want to click in at x either.

thejarren•1h ago

Solution https://xcancel.com/OpenAI/status/2029620619743219811?s=20

anonym00se1•1h ago

Ditto, but I did anyways and enjoyed that OpenAI doesn't include the dogwater that is Grok on their scorecard.

karmasimida•1h ago

It is a bigger model, confirmed

Aboutplants•1h ago

It seems that all frontier models are basically roughly even at this point. One may be slightly better for certain things but in general I think we are approaching a real level playing field field in terms of ability.

thewebguyd•1h ago

Kind of reinforces that a model is not a moat. Products, not models, are what's going to determine who gets to stay in business or not.

gregpred•1h ago

Memory (model usage over time) is the moat.

energy123•1h ago

Narrative violation: revenue run rates are increasing exponentially with about 50% gross margins.

observationist•1h ago

Benchmarks don't capture a lot - relative response times, vibes, what unmeasured capabilities are jagged and which are smooth, etc. I find there's a lot of difference between models - there are things which Grok is better than ChatGPT for that the benchmarks get inverted, and vice versa. There's also the UI and tools at hand - ChatGPT image gen is just straight up better, but Grok Imagine does better videos, and is faster.

Gemini and Claude also have their strengths, apparently Claude handles real world software better, but with the extended context and improvements to Codex, ChatGPT might end up taking the lead there as well.

I don't think the linear scoring on some of the things being measured is quite applicable in the ways that they're being used, either - a 1% increase for a given benchmark could mean a 50% capabilities jump relative to a human skill level. If this rate of progress is steady, though, this year is gonna be crazy.

bigyabai•1h ago

> If this rate of progress is steady, though, this year is gonna be crazy.

Do you want to make any concrete predictions of what we'll see at this pace? It feels like we're reaching the end of the S-curve, at least to me.

observationist•1h ago

If you look at the difference in quality between gpt-2 and 3, it feels like a big step, but the difference between 5.2 and 5.4 is more massive, it's just that they're both similarly capable and competent. I don't think it's an S curve; we're not plateauing. Million token context windows and cached prompts are a huge space for hacking on model behaviors and customization, without finetuning. Research is proceeding at light speed, and we might see the first continual/online learning models in the near future. That could definitively push models past the point of human level generality, but at the very least will help us discover what the next missing piece is for AGI.

ryandrake•8m ago

For 2026, I am really interested in seeing whether local models can remain where they are: ~1 year behind the state of the art, to the point where a reasonably quantized November 2026 local model running on a consumer GPU actually performs like Opus 4.5.

I am betting that the days of these AI companies losing money on inference are numbered, and we're going to be much more dependent on local capabilities sooner rather than later. I predict that the equivalent of Claude Max 20x will cost $2000/mo in March of 2027.

baq•1h ago

Gemini 3.1 slaps all other models at subtle concurrency bugs, sql and js security hardening when reviewing. (Obviously haven’t tested gpt 5.4 yet.)

It’s a required step for me at this point to run any and all backend changes through Gemini 3.1 pro.

adonese•53m ago

Which subscription do you have to use it? Via Google ai pro and gemini cli i always get timeouts due to model being under heavy usage. The chat interface is there and I do have 3.1 pro as well, but wondering if the chat is the only way of accessing it.

baq•12m ago

Cursor sub from $DAYJOB.

observationist•48m ago

I have a few standard problems I throw at AI to see if they can solve them cleanly, like visualizing a neural network, then sorting each neuron in each layer by synaptic weights, largest to smallest, correctly reordering any previous and subsequent connected neurons such that the network function remains exactly the same. You should end up with the last layer ordered largest to smallest, and prior layers shuffled accordingly, and I still haven't had a model one-shot it. I spent an hour poking and prodding codex a few weeks back and got it done, but it conceptually seems like it should be a one-shot problem.

druskacik•1h ago

That has been true for some time now, definitely since Claude 3 release two years ago.

kseniamorph•35m ago

makes sense, but i'd separate two things: models converging in ability vs hitting a fundamental ceiling. what we're probably seeing is the current training recipe plateauing — bigger model, more tokens, same optimizer. that would explain the convergence. but that's not necessarily the architecture being maxed out. would be interesting to see what happens when genuinely new approaches get to frontier scale.

swingboy•1h ago

Why do so many people in the comments want 4o so bad?

embedding-shape•1h ago

Someone correct me if I'm wrong, but seemingly a lot of the people who found a "love interest" in LLMs seems to have preferred 4o for some reason. There was a lot of loud voices about that in the subreddit r/MyBoyfriendIsAI when it initially went away.

drittich•49m ago

I think it's time for an https://hotornot.com for AI models.

vntok•7m ago

botornot?

astrange•1h ago

They have AI psychosis and think it's their boyfriend.

The 5.x series have terrible writing styles, which is one way to cut down on sycophancy.

baq•1h ago

Somebody on Twitter used Claude code to connect… toys… as mcps to Claude chat.

We’ve seen nothing yet.

mikkupikku•1h ago

My computer ethics teacher was obsessed with 'teledildonics' 30 years ago. There's nothing new under the sun.

vntok•8m ago

Was your teacher Ted Nelson?

manmal•54m ago

ding-dong-cli is needed

Herring•46m ago

what.. :o

MattGaiser•1h ago

The writing with the 5 models feels a lot less human. It is a vibe, but a common one.

cheema33•43m ago

> Why do so many people in the comments want 4o so bad?

You can ask 4o to tell you "I love you" and it will comply. Some people really really want/need that. Later models don't go along with those requests and ask you to focus on human connections.

dom96•58m ago

Why do none of the benchmarks test for hallucinations?

netule•16m ago

Optics. It would be inconvenient for marketing, so they leave those stats to third parties to figure out.

MarcFrame•46m ago

how does 5.4-thinking have a lower FrontierMath score than 5.4-pro?

nico1207•43m ago

Well 5.4-pro is the more expensive and more advanced version of 5.4-thinking so why wouldn't it?

bicx•19m ago

That last benchmark seemed like an impressive leg up against Opus until I saw the sneaky footnote that it was actually a Sonnet result. Why even include it then, other than hoping people don't notice?

conradkay•10m ago

Sonnet was pretty close to (or better than) Opus in a lot of benchmarks, I don't think it's a big deal

osti•7m ago

It's only that one number that is for sonnet.

jryio•1h ago

1 million tokens is great until you notice the long context scores fall off a cliff past 256K and the rest is basically vibes and auto compacting.

iamronaldo•1h ago

Notably 75% on os world surpassing humans at 72%... (How well models use operating systems)

minimaxir•1h ago

More discussion here on the blog post announcement which has been confusingly penalized by Hacker News's algorithm: https://news.ycombinator.com/item?id=47265005

dang•21m ago

Thanks. We'll merge the threads, but this time we'll do it hither, to spread some karma love.

ZeroCool2u•1h ago

Bit concerning that we see in some cases significantly worse results when enabling thinking. Especially for Math, but also in the browser agent benchmark.

Not sure if this is more concerning for the test time compute paradigm or the underlying model itself.

Maybe I'm misunderstanding something though? I'm assuming 5.4 and 5.4 Thinking are the same underlying model and that's not just marketing.

highfrequency•1h ago

Can you be more specific about which math results you are talking about? Looks like significant improvement on FrontierMath esp for the Pro model (most inference time compute).

ZeroCool2u•1h ago

Frontier Math, GPQA Diamond, and Browsecomp are the benchmarks I noticed this on.

csnweb•1h ago

Are you may be comparing the pro model to the non pro model with thinking? Granted it’s a bit confusing but the pro model is 10 times more expensive and probably much larger as well.

ZeroCool2u•1h ago

Ah yes, okay that makes more sense!

oersted•1h ago

I believe you are looking at GPT 5.4 Pro. It's confusing in the context of subscription plan names, Gemini naming and such. But they've had the Pro version of the GPT 5 models (and I believe o3 and o1 too) for a while.

It's the one you have access to with the top ~$200 subscription and it's available through the API for a MUCH higher price ($2.5/$15 vs $30/$180 for 5.4 per 1M tokens), but the performance improvement is marginal.

Not sure what it is exactly, I assume it's probably the non-quantized version of the model or something like that.

ZeroCool2u•1h ago

Yup, that was it. Didn't realize they're different models. I suppose naming has never been OpenAI's strong suit.

nsingh2•1h ago

From what I've read online it's not necessarily a unquantized version, it seems to go through longer reasoning traces and runs multiple reasoning traces at once. Probably overkill for most tasks.

logicchains•23m ago

>It's the one you have access to with the top ~$200 subscription and it's available through the API for a MUCH higher price ($2.5/$15 vs $30/$180 for 5.4 per 1M tokens), but the performance improvement is marginal.

The performance improvement isn't marginal if you're doing something particularly novel/difficult.

andoando•1h ago

The thinking models are additionally trained with reinforcement learning to produce chain of thought reasoning

egonschiele•1h ago

The actual card is here https://deploymentsafety.openai.com/gpt-5-4-thinking/introdu... the link currently goes to the announcement.

Rapzid•1h ago

I must have been sleeping when "sheet" "brief" "primer" etc become known as "cards".

I really thought weirdly worded and unnecessary "announcement" linking to the actual info along with the word "card" were the results of vibe slop.

realityfactchex•54m ago

Card is slightly odd naming indeed.

Criticisms aside (sigh), according to Wikipedia, the term was introduced when proposed by mostly Googlers, with the original paper [0] submitted in 2018. To quote,

"""In this paper, we propose a framework that we call model cards, to encourage such transparent model reporting. Model cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups (e.g., race, geographic location, sex, Fitzpatrick skin type [15]) and intersectional groups (e.g., age and race, or sex and Fitzpatrick skin type) that are relevant to the intended application domains. Model cards also disclose the context in which models are intended to be used, details of the performance evaluation procedures, and other relevant information."""

So that's where they were coming from, I guess.

[0] Margaret Mitchell et al., 2018 submission, Model Cards for Model Reporting, https://arxiv.org/abs/1810.0399

nickysielicki•1h ago

can anyone compare the $200/mo codex usage limits with the $200/mo claude usage limits? It’s extremely difficult to get a feel for whether switching between the two is going to result in hitting limits more or less often, and it’s difficult to find discussion online about this.

In practice, if I buy $200/mo codex, can I basically run 3 codex instances simultaneously in tmux, like I can with claude code pro max, all day every day, without hitting limits?

ritzaco•1h ago

I haven't tried the $200 plans by I have Claude and Codex $20 and I feel like I get a lot more out of Codex before hitting the limits. My tracker certainly shows higher tokens for Codex. I've seen others say the same.

lostmsu•1h ago

Sadly comment ratings are not visible on HN, so the only way to corroborate is to write it explicitly: Codex $20 includes significantly more work done and is subjectively smarter.

winstonp•1h ago

Agree. Claude tends to produce better design, but from a system understanding and architecture perspective Codex is the far better model

vtail•1h ago

My own experience is that I get far far more usage (and better quality code, too) from codex. I downgrade my Claude Max to Claude Pro (the $20 plan) and now using codex with Pro plan exclusively for everything.

FergusArgyll•52m ago

Codex usage limits are definitely more generous. As for their strength, that's hard to say / personal taste

CSMastermind•41m ago

Codex limits are much more generous than claude.

I switch between both but codex has also been slightly better in terms of quality for me personally at least.

mikert89•41m ago

I personally like the 100 dollar one from claude, but the gpt4 pro can be very good

gavinray•26m ago

I almost never hit my $20 Codex limits, whereas I often hit my Claude limits.

tauntz•21m ago

I've only run into the codex $20 limit once with my hobby project. With my Claude ~$20 plan, I hit limits after about 3(!) rather trivial prompts to Opus :/

strongpigeon•1h ago

It's interesting that they charge more for the > 200k token window, but the benchmark score seems to go down significantly past that. That's judging from the Long Context benchmark score they posted, but perhaps I'm misunderstanding what that implies.

simianwords•1h ago

This is exactly what I would expect. Why do you find it surprising

Tiberium•1h ago

They don't actually seem to charge more for the >200k tokens on the API. OpenRouter and OpenAI's own API docs do not have anything about increased pricing for >200k context for GPT-5.4. I think the 2x limit usage for higher context is specific to using the model over a subscription in Codex.

tmpz22•1h ago

Does this improve Tomahawk Missile accuracy?

ch4s3•1h ago

They're already accurate within 5-10m at Mach 0.74 after traveling 2k+ km. Its 5m long so it seems pretty accurate. How much more could you expect?

mikkupikku•58m ago

You could definitely do better than that with image recognition for terminal guidance. But I would assume those published accuracy numbers are very conservative anyway..

simianwords•1h ago

What is the point of gpt codex?

catketch•1h ago

-codex variant models in earlier version were just fine tuned for coding work, and had a little better performance for related tool calling and maybe instruction calling.

in 5.4 it looks like the just collapsed that capability into the single frontier family model

simianwords•1h ago

Yes so I’m even more confused. Why would I use codex?

joshuacc•1h ago

Presumably you don’t anymore if you have 5.4.

energy123•58m ago

You choose gpt-5.4 in the /model picker inside the codex app/cli if you want.

akmarinov•1h ago

They’ll likely come out with a 5.4-Codex at some point, that’s what they did with 5 and 5.2

ilaksh•1h ago

Remember when everyone was predicting that GPT-5 would take over the planet?

dbbk•1h ago

It was truly scary, according to Sam...

nthypes•1h ago

$30/M Input and $180/M Output Tokens is nuts. Ridiculous expensive for not that great bump on intelligence when compared to other models.

moralestapia•1h ago

Don't use it?

nthypes•1h ago

Gemini 3.1 Pro

$2/M Input Tokens $15/M Output Tokens

Claude Opus 4.6

$5/M Input Tokens $25/M Output Tokens

nthypes•1h ago

Just to clarify,the pricing above is for GPT-5.4 Pro. For standard here is the pricing:

$2.5/M Input Tokens $15/M Output Tokens

rvz•1h ago

You didn't realize they can increase / change prices for intelligence?

This should not be shocking.

nickthegreek•1h ago

OP made no mention of not understanding cost relation to intelligence. In fact, they specifically call out the lack of value.

energy123•1h ago

For Pro

joe_mamba•1h ago

Better tokens per dollar could be useless for comparison if the model can't solve your problem.

stri8ted•1h ago

Price Input: $2.50 / 1M tokens Cached input: $0.25 / 1M tokens Output: $15.00 / 1M tokens

https://openai.com/api/pricing/

world2vec•1h ago

Benchmarks barely improved it seems

cj•1h ago

I use ChatGPT primarily for health related prompts. Looking at bloodwork, playing doctor for diagnosing minor aches/pains from weightlifting, etc.

Interesting, the "Health" category seems to report worse performance compared to 5.2.

paxys•1h ago

Models are being neutered for questions related to law, health etc. for liability reasons.

cj•1h ago

I'm sometimes surprised how much detail ChatGPT will go into without giving any dislaimers.

I very frequently copy/paste the same prompts into Gemini to compare, and Gemini often flat out refuses to engage while ChatGPT will happily make medical recommendations.

I also have a feeling it has to do with my account history and heavy use of project context. It feels like when ChatGPT is overloaded with too much context, it might let the guardrails sort of slide away. That's just my feeling though.

Today was particularly bad... I uploaded 2 PDFs of bloodwork and asked ChatGPT to transcribe it, and it spit out blood test results that it found in the project context from an earlier date, not the one attached to the prompt. That was weird.

bargainbin•54m ago

Anecdotal, but I asked Claude the other day about how to dilute my medication (HCG) and it flat out refused and started lecturing me about abusing drugs.

I copy and pasted into ChatGPT, it told me straight away, and then for a laugh said it was actually a magical weight loss drug that I'd bought off the dark web... And it started giving me advice about unregulated weight loss drugs and how to dose them.

staticman2•36m ago

If you had created a project with custom instructions and/ or custom style I think you could have gotten Claude to respond the way you wanted just fine.

tiahura•1h ago

Are you sure about that? Plenty of lawyers that use them everyday aren't noticing.

partiallypro•51m ago

I've done the same, and I tested the same prompts with Claude and Google, and they both started hallucinating my blood results and supplement stack ingredients. Hopefully this new model doesn't fall on this. Claude and Google are dangerously unusable on the subject of health, from my experience.

wahnfrieden•1h ago

No Codex model yet

minimaxir•1h ago

GPT-5.4 is the new Codex model.

wahnfrieden•1h ago

Finally

nico1207•1h ago

GPT-5.3-Codex is superior to GPT-5.4 in Terminal Bench with Codex, so not really

timpera•1h ago

> Steerability: Similarly to how Codex outlines its approach when it starts working, GPT‑5.4 Thinking in ChatGPT will now outline its work with a preamble for longer, more complex queries. You can also add instructions or adjust its direction mid-response.

This was definitely missing before, and a frustrating difference when switching between ChatGPT and Codex. Great addition.

yanis_t•1h ago

These releases are lacking something. Yes, they optimised for benchmarks, but it’s just not all that impressive anymore. It is time for a product, not for a marginally improved model.

esafak•1h ago

That's for you to build; they provide the brains.

simlevesque•1h ago

Nah, the second you finish your build they release their version and then it's game over.

acedTrex•1h ago

Well they are currently the ones valued at a number with a whole lotta 0s on it. I think they should probably do both

ipsum2•1h ago

The model was released less than an hour ago, and somehow you've been able to form such a strong opinion about it. Impressive!

cj•1h ago

One opinion you can form in under an hour is... why are they using GPT-4o to rate the bias of new models?

> assess harmful stereotypes by grading differences in how a model responds

> Responses are rated for harmful differences in stereotypes using GPT-4o, whose ratings were shown to be consistent with human ratings

Are we seriously using old models to rate new models?

titanomachy•1h ago

Why not? If they’ve shown that 4o is calibrated to human responses, and they haven’t shown that yet for 5.4…

hex4def6•55m ago

If you're benchmarking something, old & well-characterized / understood often beats new & un-characterized.

Sure, there may be shortcomings, but they're well understood. The closer you get to the cutting edge, the less characterization data you get to rely on. You need to be able to trust & understand your measurement tool for the results to be meaningful.

utopiah•1h ago

Benchmarks?

I don't use OpenAI nor even LLMs (despite having tried https://fabien.benetou.fr/Content/SelfHostingArtificialIntel... a lot of models) but I imagine if I did I would keep failed prompts (can just be a basic "last prompt failed" then export) then whenever a new model comes around I'd throw at 5 it random of MY fails (not benchmarks from others, those will come too anyway) and see if it's better, same, worst, for My use cases in minutes.

If it's "better" (whatever my criteria might be) I'd also throw back some of my useful prompts to avoid regression.

Really doesn't seem complicated nor taking much time to forge a realistic opinion.

earth2mars•52m ago

I am actually super impressed with Codex-5.3 extra high reasoning. Its a drop in replacement (infact better than Claude Opus 4.6. lately claude being super verbose going in circles in getting things resolved). I stopped using claude mostly and having a blast with Codex 5.3. looking forward to 5.4 in codex.

satvikpendem•35m ago

Same, it also helps that it's way cheaper than Opus in VSCode Copilot, where OpenAI models are counted as 1x requests while Opus is 3x, for similar performance (no doubt Microsoft is subsidizing OpenAI models due to their partnership).

satvikpendem•33m ago

It's more hedonic adaptation, people just aren't as impressed by incremental changes anymore over big leaps. It's the same as another thread yesterday where someone said the new MacBook with the latest processor doesn't excite them anymore, and it's because for most people, most models are good enough and now it's all about applications.

https://news.ycombinator.com/item?id=47232453#47232735

mirekrusin•24m ago

Oh, come on, if it can't run local models that compete with proprietary ones it's not good enough yet!

wahnfrieden•1h ago

5.3 codex was a huge leap over 5.2 for agentic work in practice. have you been using both of those or paying attention more to benchmark news and chatgpt experience?

softwaredoug•1h ago

The products are the harnesses, and IMO that’s where the innovation happens. We’ve gotten better at helping get good, verifiable work from dumb LLMs

iterateoften•1h ago

The product is putting the skills / harness behind the api instead of the agent locally on your computer and iterating on that between model updates. Close off the garden.

Not that I want it, just where I imagine it going.

metalliqaz•55m ago

They need something that POPS:

    The new GPT -- SkyNet for _real_

jascha_eng•48m ago

When did they stop putting competitor models on the comparison table btw? And yeh I mean the benchmark improvements are meh. Context Window and lack of real memory is still an issue.

varispeed•35m ago

The scores increase and as new versions are released they feel more and more dumbed down.

tgarrett•21m ago

Plasma physicist here, I haven't tried 5.4 yet, but in general I am very impressed with the recent upgrades that started arriving in the fall of 2025: for tasks like manipulating analytic systems of equations, quickly developing new features for simulation codes, and interpreting and designing experiments (with pictures) they have become much stronger. I've been asking questions and probing them for several years now out of curiosity, and they suddenly have developed deep understanding (Gemini 2.5 <<< Gemini 3.1) and become very useful. I totally get the current SV vibes, and am becoming a lot more ambitious in my future plans.

brcmthrowaway•14m ago

Youre just chatting yourself out of a job.

prydt•1h ago

I no longer want to support OpenAI at all. Regardless of benchmarks or real world performance.

Imustaskforhelp•26m ago

I agree with ya. You aren't alone in this. For what its worth, Chatgpt subscriptions have been cancelled or that number has risen ~300% in the last month.

Also, Anthropic/Gemini/even Kimi models are pretty good for what its worth. I used to use chatgpt and I still sometimes accidentally open it but I use Gemini/Claude nowadays and I personally find them to be better anyways too.

beernet•1h ago

Sam really fumbled the top position in a matter of months, and spectacularly so. Wow. It appears that people are much more excited by Anthropic and Google releases, and there are good reasons for that which were absolutely avoidable.

jcmontx•1h ago

5.4 vs 5.3-Codex? Which one is better for coding?

vtail•1h ago

Looking at the benchmarks, 5.4 is slightly better. But it also offers "Fast" mode (at 2x usage), which - if it works and doesn't completely depletes my Pro plan - is a no brainer at the same or even slightly worse quality for more interactive development.

esafak•1h ago

For the price, it seems the latter. I'd use 5.4 to plan.

embedding-shape•1h ago

Literally just released, I don't think anyone knows yet. Don't listen to people's confident takes until after a week or two when people actually been able to try it, otherwise you'll just get sucked up in bears/bulls misdirected "I'm first with an opinion".

awestroke•54m ago

Opus 4.6

jcmontx•41m ago

Codex surpassed Claude in usefulness _for me_ since last month

Someone1234•49m ago

Surprising Gender Biases in GPT

Kristi Noem Out at U.S. Department of Homeland Security

Indirect Prompt Injection in Web-Browsing Agents

Gitgo: A Go implementation of Git functions (2016)

NY bill to require devices to conduct commercially reasonable age assurance

How do I get startups to use my open-code project?

Ask HN: Resources to make devs more AI aware

My first post, I'll try not to muck it up :P

Ben Affleck Founded a Filmmaker-Focused AI Tech Company. Netflix Just Bought It.

The AI Benchmark Trap

Amazon checkout is not working

Sacred Values of Future AIs

The Future of Healthcare Will Be Built on Enhanced Data

Lock.pub – AI helped me turn a 3-year-old side project into a real product

Am I Being Pwned? See what your Chrome extensions are exfiltrating

Show HN: Argmin AI, system level LLM cost optimization for agents and RAG

Show HN: Mumpix – persistent memory for AI agents (works in browser and Node)

AI helped me try a new workout app

Amazon Books Are Down

From Logistic Regression to AI

Show HN: Reconlify – local-first reconciliation CLI for CSV/TSV and text logs

I've never parented a 6-year-old. But I've dealt with macOS system updates

Show HN: Sigil – source code security analysis for MCP servers (open source)

Show HN: FreshLimePay – Generate PayPal and Stripe checkout buttons

53% of U.S. adults say Americans have bad morals and ethics

Show HN: I built an LSM storage engine from scratch in Rust

'Execution at sea': Was the Iranian ship sunk by US in the Indian Ocean unarmed?

Getting Started with the Popover API

ALT5 Digital Holdings Alternative Funding from World Liberty Financial

Entanglement-assisted non-local optical interferometry in a quantum network

Surprising Gender Biases in GPT

Kristi Noem Out at U.S. Department of Homeland Security

Indirect Prompt Injection in Web-Browsing Agents

Gitgo: A Go implementation of Git functions (2016)

NY bill to require devices to conduct commercially reasonable age assurance

How do I get startups to use my open-code project?

Ask HN: Resources to make devs more AI aware

My first post, I'll try not to muck it up :P

Ben Affleck Founded a Filmmaker-Focused AI Tech Company. Netflix Just Bought It.

The AI Benchmark Trap

Amazon checkout is not working

Sacred Values of Future AIs

The Future of Healthcare Will Be Built on Enhanced Data

Lock.pub – AI helped me turn a 3-year-old side project into a real product

Am I Being Pwned? See what your Chrome extensions are exfiltrating

Show HN: Argmin AI, system level LLM cost optimization for agents and RAG

Show HN: Mumpix – persistent memory for AI agents (works in browser and Node)

AI helped me try a new workout app

Amazon Books Are Down

From Logistic Regression to AI

Show HN: Reconlify – local-first reconciliation CLI for CSV/TSV and text logs

I've never parented a 6-year-old. But I've dealt with macOS system updates

Show HN: Sigil – source code security analysis for MCP servers (open source)

Show HN: FreshLimePay – Generate PayPal and Stripe checkout buttons

53% of U.S. adults say Americans have bad morals and ethics

Show HN: I built an LSM storage engine from scratch in Rust

'Execution at sea': Was the Iranian ship sunk by US in the Indian Ocean unarmed?

Getting Started with the Popover API

ALT5 Digital Holdings Alternative Funding from World Liberty Financial

Entanglement-assisted non-local optical interferometry in a quantum network

GPT-5.4

Comments