Why current LLM costs are not sustainable

https://aditya.patadia.org/p/ai-and-cloud-costs

71•adityapatadia•1h ago

Comments

chiply314•56m ago

I think companies will fire 5-10% of people and convert them to token budget.

I also believe that before any real companies are running these models locally, they will already have some kind of agentic layer.

With the current frontier model lab progress, i do not see any real company which makes real money, running local models.

Running local models is easy for me, for sure not that easy for any company. Your DC needs to be able to host GPUs, it needs the cooling power, you need to have a DC. Without a DC, you need to have someone maintaining critical infrastrucutre, taking care of model evaluation etc.

For external parties, there might become a new business model: You might not hire an external anymore, but a token budget and the 'operator of the token budget'.

The current chip fabs are full, developing a high end / cheapisch local LLM Chip will still take a few years as long as the DC GPU demand is still as high as it is.

lukebuehler•50m ago

I work with large enterprises that _only_ run critical workloads on locally hosted models. Think banks, insurance, etc--businesses that absolutely cannot leak any data. They also have CC and Codex, but their use is extremely restricted; anything of consequence runs on models running on GPU clusters in their own datacenter.

chiply314•22m ago

I work at large enterprise and they are happy paying Microsoft and AWS for model hosting.

But for sure there will be use cases of very critical data, but at the end the question will still be how big they are in comparision to the rest of the market.

These cricial workloads also have the cost issue, right? so will they reduce workforce to compensate for the budget?

netdevphoenix•42m ago

I am calling it now. LLM hosting is the new web hosting. You will have a market of hosting providers offering you access to LLM compatible hardware (the Hetzners of the LLM world) as well as virtualised LLM access (the Heroku of the LLM world). These will compete along pricing, ownership axes while frontier labs will compete mostly on performance, integration and ease of use (think Wordpress).

That's the only way I can see frontier labs charging high enough to sustain the cash flow needed to operate as racing to the bottom is not possible for them.

It is interesting to think whether this is another "Cambrian" era like the smartphone OSes when you had Symbian, Android, iOs, Windows Mobile and so many others competing.

chiply314•20m ago

I work at a very big company and they just pay azure and aws to host claude and co for them.

So the hyperscalers already won for now probably.

At the end of the day, you send a lot of personal data to these endpoints. If you already host everything through microsoft already, LLM hosting is then a no brainer.

walrus01•55m ago

I have already seen a number of people doing the math on what it would take for hardware to self host a Q8XL quantization of GLM5.2 shared between N numbers of people.

There's additional advantages that everything you query, all of your context cache and everything it outputs stays private and can't be arbitrarily turned off by external interference.

Personally I think it would be a fairly good bet that something with the 1TB of RAM needed to properly self-host GLM5.2 will still be a very usable piece of hardware in 4 to 5 years from now. There will be even larger, newer models available, sure. But there will also be better models that continue to fit in the same size.

dofm•19m ago

Back in the earlier days of the internet, when "dedicated servers" were a competitive advantage, hobbyists and small dev shops definitely shared dedicated hardware.

So you could see small LLM co-operatives working out, yeah.

But my thinking is that this four-to-five-year scenario just won't come to fruition, because the whole concept of needing to run these massive, massive models will slightly more likely be rendered moot by smaller models with better reasoning capacity, and possibly even in that timescale by hardware innovations.

One of the biggest problems I have with the whole "we won't be profitable until 2030" model is that 2030 is almost exactly as far into the future as the launch of ChatGPT is in the past, and in that time, models far more capable than that first ChatGPT have been made available to freely download and run on desktop hardware that existed before it launched, and the entire non-model surrounding functionality of that original ChatGPT plus many more functions is now not much more than a routine weekend coding project.

I don't know why the market would entertain the idea that no upset like that is possible in the same period of time again.

xienze•3m ago

> So you could see small LLM co-operatives working out, yeah.

Only on a pay-per-token basis, I think. Unless it's a very tight-knit circle of folks. Fixed monthly subscription costs I doubt would work in that model. Because you'll get the inevitable: someone pegging the service 24/7 because it's "unlimited" while everyone else suffers.

albertgoeswoof•52m ago

One thing missing here is the maturity of agent harnesses. I’m finding the free deepseek flash model in opencode can handle all of my simple tasks, because the harness is so good. Soon that will be a local model.

And the reality is that other industries aren’t finding the use for LLMs as much as programmers are. Sure there are some benefits but you can’t fire your marketing department and replace it with AI

bflesch•19m ago

AI is google-in-a-box, and there will be dedicated hardware to run it locally like there was with the crypto ASICs.

I feel the only ones losing are the AI startups and Google. This is why they're trying to morph into a social-media like experience of simulated human interaction that can monetize a certain demographic of vulnerable people.

KronisLV•49m ago

> To give an example, just doing Typescript type fixes with this model across 50 files cost me $54 this afternoon.

If you can use a subscription with any of the SOTA models, do that.

Instead of around 4k EUR in token costs, my Opus usage costs me 108 EUR (with taxes) per month with their Max 5x plan. It's the same with OpenAI, those are heavily subsidized.

It doesn't make sense to pay per-token, unless you must.

> What is happening here is that leading AI labs are charging not only for inference but also for research in model architecture, training data collection and curation, model training cost (which can be tens or even hundreds of millions of dollars), paying their employees and recovering the marketing costs.

Chances are, they're never getting that money back. Best case scenario, the hype around AI slowly declines, worst case - it crashes and takes a part of the economy with it.

Also anyone doing distillation with hundreds or thousands of those subsidized attacks is probably winning big. Especially as the model architectures (e.g. DeepSeek V4) are more oriented towards efficiency.

> Last but not least and in fact the most important factor, is the ability of users to run local models. So far, almost everyone is using cloud-hosted models and local models are either too big to deploy or too slow to work with. With advancements in chips, this will change in 4-5 years’ time.

Currently beefy hardware to run them fast enough to be competitive with the cloud (at least 60 tps) is expensive and even then the small local models quite suck compared to SOTA or even DeepSeek V4 Pro and GLM 5.2, though they're way better than they used to be (compare Qwen 3.6 with 2.5 for example).

ReptileMan•43m ago

Why do you think that subscriptions are subsidized and not that enterprise tokens are sold at 3000% margin? There are few enough frontier labs that cartel is possible.

_flux•37m ago

I think this comes from the idea that serving these tokens without paying for training is already expensive, e.g. https://news.ycombinator.com/item?id=46613887 self-hosted solution might give you only 10-100x more affordable solution at cost.

So, given the SOTA providers with even larger models also need to continously be using considerable resources for training their next models, to fund future data centers, and make profit, the token costs are more likely reflecting the real costs, rather than the subscription costs.

dude250711•48m ago

Not the end of the world!

OpenAI and Anthropic will just go back to entirely healthy valuations of ~$5-10B each and the industry carries on.

arjunchint•46m ago

There is a wave of users switching over to DeepSeek Flash. There are Reddit threads of users sharing billion token spend for $20.

If all of global spend on Anthropic/OpenAI/Gemini APIs just switches over to DeepSeek then easily we can decrease total AI spend by 10x

juleiie•44m ago

I am not sure if that is wise. It’s a hostile superpower after all

ReptileMan•42m ago

Well... Open weights on premise is politically neutral.

akie•41m ago

Try doing it at scale for a whole office. Not trivial.

ReptileMan•38m ago

You could probably do with couple of instances. People rarely use ai 24/7, so right now you can oversubscribe and still have acceptable latency and high utilization rate.

arjunchint•37m ago

There are plenty of US based hosters racing to optimize and drive efficiencies

Literal race on twitter posting to increase token throughput and drive down costs on these Chinese open source models

akie•43m ago

I am convinced that the combination of capable open weight models and specialized hardware will mean that Apple (and other hardware providers) will start shipping computers with built-in, hardwired, "LLM-on-a-chip" cards that are capable enough to meet 90% of your AI needs.

I really believe that in the near-term future we will run our LLMs in hardware, not in software. Hardwire a capable model into a device the size of a graphics card, embed it into a laptop, and you have something that uses less power, does faster inference, doesn't require additional CPU or memory, doesn't cost a monthly fee, and will probably eventually be available for under a (few) hundred bucks.

karussell•42m ago

The current costs do not have to be sustainable for the SOTA model providers as they grow their user base. But I really wonder about the future as the costs have to increase at some point (to be sustainable) but at the same time the competition and local models get better and better.

exizt88•38m ago

> We are seeing improvements with each model release these days but it’s clear that the improvements are getting smaller and smaller.

This is obviously untrue, both with GPT-5.4, and Claude Fable as examples in the last 6 months.

rubin55•36m ago

I would struggle to ascertain the day-to-day difference between GPT-5.4 and GPT-5.5 tbh. Also, imho, Fable is highly hyped, I don't think it is dramatically better than Opus 4.8. Maybe my tasks and interaction with AI is relatively simple (i.e., lots of Rust programming, Linux system engineering stuff).

chiply314•12m ago

I haven't had enough time with fable, but I had to look back on how i worked with claude just 6 month ago to remind myself that it got a lot better.

Like i still used plan mode 6 month ago now I don't.

I would argue that with every model release we have a new learning phase.

byzantinegene•33m ago

gpt 5.5 regularly wastes tokens on wrong commands, requires lots of handholding. I highly doubt there's substantial improvement

skerit•18m ago

> the improvements are getting smaller and smaller

The AI haters have been saying this for 2 years now.

jillesvangurp•38m ago

Curren prices will come down. There is a lot of potential for optimization. Energy efficiency, energy generation, self hosting, model size and specialization. Etc. Rught now the state of the art is powering data centers with gas powered turbine generators. That's not very efficient.

bflesch•28m ago

Of course, but will the AI startups with their SaaS business model survive?

jeswin•38m ago

Would prefer not to offend the author, but I do believe this article has very little for the HN audience. No new insight, and no numbers or new information.

ludamad•8m ago

Is there any place with better curation? I notice quite a few articles summarizing the state of AI that feel redundant with one another

bflesch•33m ago

Spot on. From an US outsider's perspective there's so much ridiculous stuff going on that you feel like you're watching an episode of "bum fights". I don't think US knowledge workers alone can carry this bubble.

xmstan•32m ago

There is a good and cheap alternative: R9700 32GB + Qwen 3.5 27B Won't give you SOTA performance, but will be as good as Sonnet a few months back.

arbayi•31m ago

i think we have the causation backwards here. llms aren't expensive because they have to be — they're expensive because we keep reaching for the expensive model instead of putting any effort into making the cheap one good enough.

a surprisingly large fraction of production workloads can be handled by smaller models with the right scaffolding. it's often easier to switch to a larger model than to engineer those pieces, so many teams never bother.

my intuition is that a lot of the current "ai cost crisis" is really an orchestration problem rather than a model pricing problem. before asking whether frontier pricing is sustainable, i'd first ask how much of that spend is simple tasks being sent to the smartest available model by default.

my bet for the next few years is that the model itself stops being where the value is. frontier models will become more like commodities, and the real difference will be the layer around them as routing each task to the cheapest model that can do it well, verifying the output, and only escalating when needed.

eventually, asking "which model do you use?" will sound a bit like asking "which cpu do you use?" the engine still matters, but the system built around it matters a lot more.

FinnLobsien•28m ago

The problem space has a few aspects:

1. We're still in the "$5 airport Uber" era of LLMs. They're heavily subsidized, and everyone still complains about costs.

2. There hasn't been a real incentive to work on cost optimization for data centers and the hardware they contain. When/if price hikes happen and send people scrambling to use other models or drastically reduce AI usage, this will suddenly need to happen.

3. We're massively overusing SOTA models. As long as you're on a subsidized subscription, you can use Claude Opus 4.8 high to write blog article meta descriptions. If you paid by token, you wouldn't do that.

4. Open models are a wildcard that could completely change the calculus.

eru•27m ago

Mostly agreed, however I'm not sure about 3: I suspect it works like gym memberships, and the companies mostly make their money from people who don't use the subscriptions all that much.

FinnLobsien•21m ago

I think the problem is that the companies mostly don't make money, period. They may have better unit economics on underused subscriptions, but I don't see a world in which OAI/Anthropic don't heavily tighten the screws in the future.

Right now it's silly to default to frontier models, but it won't bankrupt your company. I believe in the short-medium term future, we'll need to be more deliberate about model choices.

In the long-term, of course, tech costs tend to plummet. Is there a future where in 15 years, my Apple Watch locally runs an Opus 4.8-class model? Maybe. And that would obviate this whole discussion.

iamacyborg•16m ago

I follow a guy called Daniel McCarthy on LinkedIn who writes a lot on CLV and that seems to be his take. Even if theoretically you get way more than you pay with subscriptions, the vast majority of people are not power users.

https://danielminhmccarthy.com/

Someone•26m ago

Of course they do. How else do you expect them to pay for that? If you buy a Foo from Acme, Inc, you aren’t only paying construction costs, either.

> On the other hand, once an open weight model is released, any inference provider can easily host it and just do some markup on inference cost. This proves way cheaper than running a frontier AI lab.

The only logical conclusion for commercial AI labs is to never release their models as open data, and try to stay ahead of open models. One way to do that is by having better models, another by having more users (because that decreases the per-user costs of creating the models, decreasing the price difference with companies running open models). The frontier labs are aiming for a combination of both.

ramon156•22m ago

> and Microsoft, Salesforce and Github are taking steps to reduce AI spend by employees.

anyone got a source? sounds juicy

simianwords•22m ago

The author understands well that Opensource is catching up but I think that the gap will remain constant - SOTA models will still be more performant.

The author mentions $54 in costs but the reality is that developers are paid around this much per hour.

What is likely to happen: LLM performance goes even higher and can do tasks that take humans days to accomplish. You then have to compare LLM cost with human cost - something the Author has forgotten in their analsys.

xienze•7m ago

> The author mentions $54 in costs but the reality is that developers are paid around this much per hour.

Sure, but imagine a situation where you've spent an hour going back and forth with the LLM trying to fix a problem and at the end of it you've only made minimal progress. Now you've spent an hour of your time AND $54 with little to show for it. It's a metric I don't think many people track: the cost of going in circles with an LLM for an extended period of time while burning tokens and still not resolving the problem.

simianwords•5m ago

That happens with humans too and for sure LLMs make it better not worse.

I know the number of times I tried to do something where the answer was simple but I took a few days to get there.

rvz•22m ago

This is no surprise at all and was very predictable.

The Chinese open weight models were always winning the AI race to zero where as the likes of Anthropic and OpenAI have no choice but to increase token costs.

Even Microsoft wants to use some of the Chinese models only realizing how expensive both the frontier models are. It turns out that Jevon's paradox does not exist in the US (it exists in China).

This "Tokenmaxxing" marketing stunt was a scam for the frontier models to raise even more money at unsustainable valuations.

_pdp_•22m ago

Prices will go down one way or another. That is of course unless the market gets cornered by restricting model use, restricting supply of essential hardware components or raw materials to make this hardware, etc.

In terms of running the model locally vs a service provider, that will be down to convenience more than anything else for the same reason why not everyone is hosting their own website at home on their own box.

chiply314•16m ago

Token prices will go down for sure, but i watched a video interview on yt from cloudflare ceo and apparently the internet traffic of agentics increased and took over human.

If we continue this year with a2a, agentic layer and co, there is probably a huge bulk coming up with a lot more agents running a lot longer and talking to each other to solve issues which will increase token usage significanlty.

swiftcoder•20m ago

> To give an example, just doing Typescript type fixes with this model across 50 files cost me $54 this afternoon.

Who in hell would actually do this? That's a level of problem that any of the flash-class models can solve.

Hand that sort of thing to GPT-mini, Haiku, or DeepSeek Flash, and save the big guns for big architectural problems.

raincole•13m ago

> To give an example, just doing Typescript type fixes with this model across 50 files cost me $54 this afternoon.

1. How much it costs in terms of programmers' salaries?

2. Can DeepSeek do this (I bet it can) and how much it costs?

The fact the author ever had the idea of using a SOTA to solve do this means LLMs are actually quite cheap.

veselin•12m ago

The more I think on the problem, the more I believe this will be solved with US interventions. And the interventions will increase inflation by a lot, so prices will not go down.

The other alternatives with LLMs becoming more expensive in an Uber-like move may not work due to a lot of competition. I also don't think usage will increase 10x. I don't always have coding tasks for an LLM despite it being good.

My reasons to believe so are outside of what interests HN community and I am neither endorsing this behavior, nor I think it is that simple. But US also has a huge debt that it must service. Wouldn't it be convenient if it was suddenly halved in actual value?

byzantinegene•7m ago

unlikely scenario as the main mandate of the federal reserve is to keep inflation in check. inflation reaching such levels would also cause interest rates to rise astronomically, and this would make the debt harder to service

ajdegol•10m ago

> doing Typescript type fixes with this model across 50 files cost me $54 this afternoon.

Not trying to be harsh, but that sounds like a skill issue. You have the language server to lean on; easy feedback loop; sub agent per type.

charcircuit•8m ago

>Most AI labs have likely ingested everything available in digital and print media for the model training.

This isn't how coding models get better though. Why would this have anything to do with plateauing?

yturijea•4m ago

I am using perhaps 15% of usage count on Claude with just the normal subscription. And I do full time software engineering and would say I use quite a lot of AI input on thoughts, designs and code drafts.

So how these companies and people manage to use these absurd amount of tokens is a mystery to me. It feels like this are just running huge amount of non-vetted data to the LLM's and or running loops against the LLM's which only produce fractional results if not wasted results for insane cost.

So really it is the equivalent of just burning money, or heating your house in the winter while having all your windows open.

starchild3001•3m ago

A few thoughts:

1. Chat, being 3 yr old, is a fairly mature and solved problem today. Top companies aren't even talking about it anymore! Gemma 31B does it amazingly well (for $0.4/1M token output). Practically every near-SoTA and SoTA model does simple "chat-like" QA amazingly well -- summarization, basic question answering, single- or few-step search.

2. Tasks -- or knowledge work on a computer -- are the new frontier. Computers have become competent only recently, and only for some of the tasks so far. I'd guess another 2-3 yr development cycle will happen, after which "el cheapo" models will be virtually distinguishable from SoTA.

As tasks are the new game in town, AI labs can still charge a premium for it; for chat that premium has disappeared already; most users cannot tell 99% correct answer from 95% correct answer; nor do they always wish for maximum accuracy.

Everyone suddenly sells themselves as "AI-native" on LinkedIn

From Isolated Agents to Agentic Mesh: Orchestrating SDLC with A2A and AP2

The Baffling World of Masayoshi Son's Presentations

Research as a Stochastic Decision Process

Self-proclaimed King of Switzerland uses loophole to build his empire for free

Monedula Apache Kafka Simulator

Jonas Lauwiner

California State Government Launches AI Job Loss Tracker as Layoff Fears Grow

New EU rules: military age Ukrainian men to lose refugee visas Jun 27 2027

The Naibbe cipher: a cipher that produces Voynich Manuscript-like ciphertext

I created a new open-source project

AI 2027 Tracker

Alan Greenspan Has Died

How much? The hidden costs of restaurant dishes

Midwit Cleanse – midwits wipe themselves off the gene pool

Translating Pandas to Polars using LLMs

Paris to ban drinking alcohol in public as hospitals hit heatwave breaking point

Paying for LLM inference by the kilowatt-hour instead of per token

How your team can save 100's of hours of work

Google Vids: AI-Powered Video Creator and Editor

GuixPkgs: Every Guix package, as a Nix flake

Micron blames Apple and customers for the lack of memory capacity

Patent-CR: A Dataset for Patent Claim Revision

Building effective pen-testing agents

The Age of the Solopreneur

CasaOS: An open-source home server OS for Docker apps

The Human Agentic Gap

Show HN: Helios – Business OS for Freelancers

What if plants could talk? (OpenAI YouTube) [video]

Snap's Evan Spiegel, Miranda Kerr help erase $550M in med debt for Californians

Everyone suddenly sells themselves as "AI-native" on LinkedIn

From Isolated Agents to Agentic Mesh: Orchestrating SDLC with A2A and AP2

The Baffling World of Masayoshi Son's Presentations

Research as a Stochastic Decision Process

Self-proclaimed King of Switzerland uses loophole to build his empire for free

Monedula Apache Kafka Simulator

Jonas Lauwiner

California State Government Launches AI Job Loss Tracker as Layoff Fears Grow

New EU rules: military age Ukrainian men to lose refugee visas Jun 27 2027

The Naibbe cipher: a cipher that produces Voynich Manuscript-like ciphertext

I created a new open-source project

AI 2027 Tracker

Alan Greenspan Has Died

How much? The hidden costs of restaurant dishes

Midwit Cleanse – midwits wipe themselves off the gene pool

Translating Pandas to Polars using LLMs

Paris to ban drinking alcohol in public as hospitals hit heatwave breaking point

Paying for LLM inference by the kilowatt-hour instead of per token

How your team can save 100's of hours of work

Google Vids: AI-Powered Video Creator and Editor

GuixPkgs: Every Guix package, as a Nix flake

Micron blames Apple and customers for the lack of memory capacity

Patent-CR: A Dataset for Patent Claim Revision

Building effective pen-testing agents

The Age of the Solopreneur

CasaOS: An open-source home server OS for Docker apps

The Human Agentic Gap

Show HN: Helios – Business OS for Freelancers

What if plants could talk? (OpenAI YouTube) [video]

Snap's Evan Spiegel, Miranda Kerr help erase $550M in med debt for Californians

Why current LLM costs are not sustainable

Comments