The End of Moore's Law for AI? Gemini Flash Offers a Warning

https://sutro.sh/blog/the-end-of-moore-s-law-for-ai-gemini-flash-offers-a-warning

113•sethkim•7mo ago

Comments

cmogni1•7mo ago

The article does a great job of highlighting the core disconnect in the LLM API economy: linear pricing for a service with non-linear, quadratic compute costs. The traffic analogy is an excellent framing.

One addition: the O(n^2) compute cost is most acute during the one-time prefill of the input prompt. I think the real bottleneck, however, is the KV cache during the decode phase.

For each new token generated, the model must access the intermediate state of all previous tokens. This state is held in the KV Cache, which grows linearly with sequence length and consumes an enormous amount of expensive GPU VRAM. The speed of generating a response is therefore more limited by memory bandwidth.

Viewed this way, Google's 2x price hike on input tokens is probably related to the KV Cache, which supports the article’s “workload shape” hypothesis. A long input prompt creates a huge memory footprint that must be held for the entire generation, even if the output is short.

trhway•7mo ago

That obviously should and will be fixed architecturally.

>For each new token generated, the model must access the intermediate state of all previous tokens.

Not all the previous tokens are equal, not all deserve the same attention so to speak. The farther the tokens, the more opportunity for many of them to be pruned and/or collapsed with other similarly distant and lesser meaningful tokens in a given context. So instead of O(n^2) it would be more like O(nlog(n))

I mean, you'd expect that for example "knowlegde worker" models (vs. say "poetry" models) would posses some perturbative stability wrt. changes to/pruning of the remote previous tokens, at least to those tokens which are less meaningful in the current context.

Personally, i feel the situation is good - performance engineering work again becomes somewhat valuable as we're reaching N where O(n^2) forces management to throw some money at engineers instead of at the hardware :)

simonw•7mo ago

"In a move that at first went unnoticed, Google significantly increased the price of its popular Gemini 2.5 Flash model"

It's not quite that simple. Gemini 2.5 Flash previously had two prices, depending on if you enabled "thinking" mode or not. The new 2.5 Flash has just a single price, which is a lot more if you were using the non-thinking mode and may be slightly less for thinking mode.

Another way to think about this is that they retired their Gemini 2.5 Flash non-thinking model entirely, and changed the price of their Gemini 2.5 Flash thinking model from $0.15/m input, $3.50/m output to $0.30/m input (more expensive) and $2.50/m output (less expensive).

Another minor nit-pick:

> For LLM providers, API calls cost them quadratically in throughput as sequence length increases. However, API providers price their services linearly, meaning that there is a fixed cost to the end consumer for every unit of input or output token they use.

That's mostly true, but not entirely: Gemini 2.5 Pro (but oddly not Gemini 2.5 Flash) charges a higher rate for inputs over 200,000 tokens. Gemini 1.5 also had a higher rate for >128,000 tokens. As a result I treat those as separate models on my pricing table on https://www.llm-prices.com

One last one:

> o3 is a completely different class of model. It is at the frontier of intelligence, whereas Flash is meant to be a workhorse. Consequently, there is more room for optimization that isn’t available in Flash’s case, such as more room for pruning, distillation, etc.

OpenAI are on the record that the o3 optimizations were not through model changes such as pruning or distillation. This is backed up by independent benchmarks that find the performance of the new o3 matches the previous one: https://twitter.com/arcprize/status/1932836756791177316

sethkim•7mo ago

Both great points, but more or less speak to the same root cause - customer usage patterns are becoming more of a driver for pricing than underlying technology improvements. If so, we likely have hit a "soft" floor for now on pricing. Do you not see it this way?

simonw•7mo ago

Even given how much prices have decreased over the past 3 years I think there's still room for them to keep going down. I expect there remain a whole lot of optimizations that have not yet been discovered, in both software and hardware.

That 80% drop in o3 was only a few weeks ago!

sethkim•7mo ago

No doubt prices will continue to drop! We just don't think it will be anything like the orders-of-magnitude YoY improvements we're used to seeing. Consequently, developers shouldn't expect the cost of building and scaling AI applications to be anything close to "free" in the near future as many suspect.

vfvthunter•7mo ago

I do not see it this way. Google is a publicly traded company responsible for creating value for their shareholders. When they became dicks about ad blockers on youtube last year or so, was it because they hit a bandwidth Moore's law? No. It was a money grab.

ChatGPT is simply what Google should've been 5-7 years ago, but Google was more interested in presenting me with ads to click on instead of helping me find what I was looking for. ChatGPT is at least 50% of my searches now. And they're losing revenue because of that.

mathiaspoint•7mo ago

I really hate the thinking. I do my best to disable it but don't always remember. So often it just gets into a loop second guessing itself until it hits the token limit. It's rare it figures anything out while it's thinking too but maybe that's because I'm better at writing prompts.

thomashop•7mo ago

I have the impression that the thinking helps even if the actual content of the thinking output is nonsense. It awards more cycles to the model to think about the problem.

wat10000•7mo ago

That would be strange. There's no hidden memory or data channel, the "thinking" output is all the model receives afterwards. If it's all nonsense, then nonsense is all it gets. I wouldn't be completely surprised if a context with a bunch of apparent nonsense still helps somehow, LLMs are weird, but it would be odd.

mathiaspoint•7mo ago

Eh. The embeddings themselves could act like hidden layer activations and encode some useful information.

yorwba•7mo ago

Attention operates entirely on hidden memory, in the sense that it usually isn't exposed to the end user. An attention head on one thinking token can attend to one thing and the same attention head on the next thinking token can attend to something entirely different, and the next layer can combine the two values, maybe on the second thinking token, maybe much later. So even nonsense filler can create space for intermediate computation to happen.

barrkel•7mo ago

This isn't quite right. Even when an LLM generates meaningless tokens, its internal state continues to evolve. Each new token triggers a fresh pass through the network, with attention over the KV cache, allowing the model to refine its contextual representation. The specific tokens may be gibberish, but the underlying computation can still reflect ongoing "thinking".

Wowfunhappy•7mo ago

Wasn't there some study that just telling the LLM to write a bunch of periods first improves responses?

krackers•7mo ago

There are several such papers, off the top of my head one is https://arxiv.org/abs/2404.15758

It's a bit more subtle though, if I understand correctly this only works for parallelizable problems. Which makes intuitive sense since the model cannot pass information along with each dot. So in that sense COT can be seen as some form of sampling, which also tracks with findings that COT doesn't boost the "raw intelligence" but rather uncovers latent intelligence, converting pass@k to maj@k. Antirez touches upon this in [1].

On the other hand, I think problems with serial dependencies require "real" COT since the model needs to track the results of subproblems. There's also some studies which show a meta-structure to the COT itself though, e.g. if you look at DeepSeek there are clear patterns of backtracking and such that are slightly more advanced than naive repeated samplings. https://arxiv.org/abs/2506.19143

[1] https://news.ycombinator.com/item?id=44288049

krackers•6mo ago

Although thinking a bit more, even constrained to only output dots, there can still some amount of information passing between each token, namely in the hidden states. The attention block N layers deep will compute attention scores off of the residual stream for previous inputs at that layer, so some information can be passed along this way.

It's not very efficient though, because for token i layer N can only receive as input layer N-1 for tokens i-1, i-2... So information is sort of passed along diagonally. If handwavily the embedding represents some "partial result" then it can be passed along diagonally from (N-1, i-1) to (N, i) to have the COT for token i+1 continue to work on it. So this way even though the total circuit depth is still bounded by # of layers, it's clearly "more powerful" than just naively going from layer 1...n, because during the other steps you can maybe work on something else.

But it's still not as powerful as allowing the results at layer n to be fed back in, which effectively unrolls the depth. This maybe intuitively justifies the results in the paper (I think it also has some connection to communication complexity).

jgalt212•7mo ago

I hate thinking mode because I prefer a mostly right answer right now over having to wait for a probably better, but still not exactly right answer.

bigbuppo•7mo ago

It's almost like there's an incentive for them to burn as many tokens as possible accomplishing nothing useful.

sharkjacobs•7mo ago

> This is the first time a major provider has backtracked on the price of an established model

Arguably that was Haiku 3.5 in October 2024.

I think the same hypothesis could apply though, that you price your model expecting a certain average input size, and then adjust price up to accommodate the reality that people use that cheapest model when they want to throw as much as they can into the context.

simonw•7mo ago

Haiku 3.5 was a completely different model from Haiku 3, and part of a new model generation.

Gemini Flash 2.5 and Gemini 2.5 Flash Preview were presumably a whole lot more similar to each other.

mossTechnician•7mo ago

Is there an expectation Haiku 3.5 is completely different? Even leaving semantic versioning aside, even if the .5 symbolizes a "halfway point" between major releases, it still suggests a non-major release to me.

simonw•7mo ago

Consumers have no idea what Haiku is.

Engineers who work with LLM APIs are hopefully paying enough attention that they understand the difference between Claude 3, Claude 3.5 and Claude 4.

ryao•7mo ago

I had the same thought about haiku 3.5. They claimed it was due to the model being more capable, which basically means that they raised the price because they could.

Then there is Poe with its pricing games. Prices at Poe have been going up over time since they were extremely aggressive to gain market share presumably under the assumption that there would be reduced pricing in the future and the reduced pricing for LLMs did not materialize.

guluarte•7mo ago

they are doing the we work approach, gain customers at all costs even if that means losing money.

FirmwareBurner•7mo ago

Aren't all LLMs loosing money at this point?

simonw•7mo ago

I don't believe that's true on inference - I think most if not all of the major providers are selling inference at a (likely very small) margin over what it costs to serve them (hardware + energy).

They likely lose money when you take into account the capital cost of training the model itself, but that cost is at least fixed: once it's trained you can serve traffic from it for as long as you chose to keep the model running in production.

bungalowmunch•7mo ago

yes I would generally agree; although I don't have a have source for this, I've heard whispers of Anthropic running at a much higher margin compared to the other labs

guluarte•7mo ago

Some companies like Google, Facebook, Microsoft, and OpenAI are definitely losing money providing free inference to millions of users daily. Companies where most users are using their API, like Anthropic, are probably seeing good margins since most of their users are paying users.

throwawayoldie•7mo ago

Yes, and the obvious endgame is wait until most software development is effectively outsourced to them, then jack the prices to whatever they want. The Uber model.

FirmwareBurner•7mo ago

Good thing AI can't replace my drinking during work time skills

ethanpailes•7mo ago

TPUs do give Google a unique structural advantage on inference cost though.

incomingpain•7mo ago

I think the big thing that really surprised me.

Llama 4 maverick is 16x 17b. So 67GB of size. The equivalency is 400billion.

Llama 4 behemoth is 128x 17b. 245gb size. The equivalency is 2 trillion.

I dont have the resources to be able to test these unfortunately; but they are claiming behemoth is superior to the best SAAS options via internal benchmarking.

Comparatively Deepseek r1 671B is 404gb in size; with pretty similar benchmarks.

But you compare deepseek r1 32b to any model from 2021 and it's going to be significantly superior.

So we have quality of models increasing, resources needed decreasing. In 5-10 years, do we have an LLM that loads up on a 16-32GB video card that is simply capable of doing it all?

sethkim•7mo ago

My two cents here is the classic answer - it depends. If you need general "reasoning" capabilities, I see this being a strong possibility. If you need specific, factual information baked into the weights themselves, you'll need something large enough to store that data.

I think the best of both worlds is a sufficiently capable reasoning model with access to external tools and data that can perform CPU-based lookups for information that it doesn't possess.

ezekiel68•7mo ago

"How is our 'Strategic Use of LLM Technology' initiative going, Harris?"

"Sir, I'm delighted to report that the productivity and insights gained outclass anything available from four years ago. We are clearly winning."

sharkjacobs•7mo ago

> If you’re building batch tasks with LLMs and are looking to navigate this new cost landscape, feel free to reach out to see how Sutro can help.

I don't have any reason to doubt the reasoning this article is doing or the conclusions it reaches, but it's important to recognize that this article is part of a sales pitch.

sethkim•7mo ago

Yes, we're a startup! And LLM inference is a major component of what we do - more importantly, we're working on making these models accessible as analytical processing tools, so we have a strong focus on making them cost-effective at scale.

sharkjacobs•7mo ago

I see your prices page lists the average cost per million tokens. Is that because you are using the formula you describe, which depends on hardware time and throughput?

> API Price ≈ (Hourly Hardware Cost / Throughput in Tokens per Hour) + Margin

samtheprogram•7mo ago

There’s absolutely nothing wrong with putting a small plug at the end of an article.

sharkjacobs•7mo ago

Of course not.

But the thrust of the article is that contrary to conventional wisdom, we shouldn't expect llm models to continue getting more efficient, and so its worthwhile to explore other options for cost savings in inference, such as batch processing.

The conclusion they reach is one which directly serves what they're selling.

I'll repeat; I'm not disputing anything in this article. I'm really not, I'm not even trying to be coy and make allusions without directly saying anything. If I thought this was bullshit I'm not afraid to semi-anonymously post a comment saying so.

But this is advertising, just like Backblaze's hard drive reliability blog posts are advertising.

jasonthorsness•7mo ago

Unfounded extrapolation from a minor pricing update. I am sure every generation of chips also came with “end of Moore’s law” articles for the actual Moore’s law.

FWIW Gemini 2.5 Flash Lite is still very good; I used it in my latest side project to generate entire web sites and it outputs great content and markup every single time.

ramesh31•7mo ago

>By embracing batch processing and leveraging the power of cost-effective open-source models, you can sidestep the price floor and continue to scale your AI initiatives in ways that are no longer feasible with traditional APIs.

Context size is the real killer when you look at running open source alternatives on your own hardware. Has anything even come close to the 100k+ range yet?

sethkim•7mo ago

Yes! Both Llama 3 and Gemma 3 have 128k context windows.

ryao•7mo ago

Llama 3 had a 8192 token context window. Llama 3.1 increased it to 131072.

ryao•7mo ago

Mistral Small 3.2 has a 131072 token context window.

georgeburdell•7mo ago

Is there math backing up the “quadratic” statement with LLM input size? At least in the traffic analogy, I imagine it’s exponential, but for small amounts exceeding some critical threshold, a quadratic term is sufficient

gpm•7mo ago

Every token has to calculate attention for every previous token, that is that attention takes O(sum_i=0^n i) work, sum_i=0^n i = n(n-1)/2, so that first expression is equivalent to O(n^2).

I'm not sure where you're getting an exponential from.

timewizard•7mo ago

I've only ever seen linear increases. When did Moore's law even _start_?

recursive•7mo ago

1970

https://en.wikipedia.org/wiki/Moore%27s_law#/media/File:Moor...

jjani•7mo ago

> In a move that at first went unnoticed

Stopped reading here, if you're positioning yourself as if you have some kind of unique insight when there is none in order to boost youe credentials and sell your product there's little chance you have anything actually insightful to offer. Might sound like an overreaction/nitpicking but it's entirely needless LinkedIn style "thought leader" nonsense.

In reality it was immediately noticed by anyone using these models, have a look at the HN threads at the time, or even on Reddit, let alone the actual spaces dedicated to AI builders.

fusionadvocate•7mo ago

What is holding back AI is this business necessity that models must perform everything. Nobody can push for a smaller model that learns a few simple tasks and then build upon that, similar to the best known intelligent machine: the human.

If these corporations had to build a car they would make the largest possible engine, because "MORE ENGINE MORE SPEED", just like they think that bigger models means bigger intelligence, but forget to add steering, or even a chassi.

dehugger•7mo ago

I agree. I want to be able to get smaller models which are complete, contained, products which we can run on-prem for our organization.

I'll take a model specialized in web scraping. Give me one trained on generating report and documentation templates (I'd commit felonies for one which could spit out a near-conplete report for SSRS).

Models trained for specific helpdesk tasks ("install a printer", "grant this user access to these services with this permission level").

A model for analyzing network traffic and identifying specific patterns.

None of these things should require titanic models nearing trillions of parameters.

cruffle_duffle•7mo ago

That’s just machine learning though!

furyofantares•7mo ago

This is extremely theorycrafted but I see this as an excellent thing driving AI forward, not holding it back.

I suspect a large part of the reason we've had many decades of exponential improvements in compute is the general purpose nature of computers. It's a narrow set of technologies that are universally applicable and each time they get better/cheaper they find more demand, so we've put an exponentially increasing amount of economical force behind it to match. There needed to be "plenty of room at the bottom" in terms of physics and plenty of room at the top in terms of software eating the world, but if we'd built special purpose hardware for each application I don't think we'd have seen such incredible sustained growth.

I see neural networks and even LLMs as being potentially similar. They're general purpose, a small set of technologies that are broadly applicable and, as long as we can keep making them better/faster/cheaper, they will find more demand, and so benefit from concentrated economic investment.

fnord123•7mo ago

They aren't arguing against LLMs They are arguing against their toaster's LLM to make the perfect toast from being trained on the tax policies of the Chang Dynasty.

furyofantares•7mo ago

I'm aware! And I'm personally excited about small models but my intuition is that maybe pouring more and more money into giant general purpose models will have payoff as long as it keeps working at producing better general purpose results (which maybe it won't).

int_19h•7mo ago

Thing is, we keep finding out again and again that having a very broad training mix in the baseline model makes it better across the board, including in those specialized tasks when you fine-tune it.

As I understand it, the general ability to reason is what the models get out of "being trained on the tax policies of the Chang Dynasty", and we haven't really figured out a better way to do so than to throw most everything at them. And even if all you do is make toast, you still need some intelligence.

fnord123•6mo ago

> And even if all you do is make toast, you still need some intelligence.

No you don't. That was the point of the example.

flakiness•7mo ago

It can be just Google trying to capitalize Gemini's increasing popularity. Until 2.5 Gemini was a total underdog. Less so since 2.5.

ezekiel68•7mo ago

There's another side to that coin: supply.

Since Gemini CLI was recently released, many people on the "free" tier noticed that their sessions immediately devolved from Gemini 2.5 Pro to Flash "due to high utilization". I asked Gemini itself about this and it reported that the finite GPU/TPU resources in Google's cloud infrastructure can get oversubscribed for Pro usage. Google (no secret here) has a subscription option for higher-tier customers to request guaranteed provisioning for the Pro model. Once their capacity gets approached, they must throttle down the lower-tier (including free) sessions to the less resource-intensive models.

Price is one lever to move once capacity becomes constrained. Yet, as the top voted comment of this post explains, it's not honest to simply label this as a price increase. They raised Flash pricing on input tokens but lowered pricing on output tokens up to certain limits -- which gives creedence to the theory that they are trying to shape the demand in order for it to better match their capacity.

apstroll•7mo ago

Extremely doubtful that it boils down to quadratic scaling of attention. That whole issue is a leftover from the days of small bert models with very few parameters.

For large models, compute is very rarely dominated by attention. Take, for example, this FLOPs calculation from https://www.adamcasson.com/posts/transformer-flops

Compute per token = 2(P + L × W × D)

P: total parameters L: Number of Layers W: context size D: Embedding dimension

For Llama 8b, the window size starts dominating compute cost per token only at 61k tokens.

fathermarz•7mo ago

Google is raising prices for most of their services. I do not agree that this is due to the cost of compute or that this is the end of Moore’s Law. I don’t think we have scratched the surface.

checker659•7mo ago

> cost of compute

DRAM scaling + interconnect bandwidth stagnation

llm_nerd•7mo ago

Basing anything on Google's pricing is folly. Quite recently Google offered several of their preview models at a price of $0.00.

Because they were the underdog. Everyone was talking about ChatGPT, or maybe Anthropic. Then Deepseek. Google were the afterthought that was renowned for that ridiculous image generator that envisioned 17th century European scientists as full-headdress North American natives.

There has been absolute 180 since then, and Google now has the ability to set their pricing similar to the others. Indeed, Google's pricing still has a pretty large discount over similarly capable model levels, even after they raised prices.

The warning is that there is no free lunch, and when someone is basically subsidizing usage to get noticed, they don't have to do that once their offering is good.

mpalmer•7mo ago

Is this overthinking it? Google had a huge incentive to outprice Anthropic and OAI to join the "conversation". I was certainly attracted to the low price initially, but I'm staying because it's still affordable and I still think the Gemini 2.5 options are the best simple mix of models available.

refulgentis•7mo ago

This is a marketing blog[^1], written with AI[^2], heavily sensationalized, & doesn't understand much in the first place.

We don't have accurate price signals externally because Google, in particular, had been very aggressive at treating pricing as a competition exercise than anything that seemed tethered to costs.

For quite some time, their pricing updates would be across-the-board exactly 2/3 of the cost of OpenAI's equivalent mode.

[^1] "If you’re building batch tasks with LLMs and are looking to navigate this new cost landscape, feel free to reach out to see how Sutro can help."

[^2] "Google's decision to raise the price of Gemini 2.5 Flash wasn't just a business decision; it was a signal to the entire market." by far the biggest giveaway, the other tells are repeated fanciful descriptions of things that could be real, that when stacked up, indicate a surreal, artifical, understanding of what they're being asked to write about, i.e. "In a move that at first went unnoticed,"

YetAnotherNick•7mo ago

Pricing != Cost.

One of the clearest example is Deepseek v3. Deepseek has mentioned its price of 0.27/1.10 has 80% profit margin, so it cost them 90% lesser than the price of Gemini flash. And Gemini flash is very likely smaller model than Deepseek v3.

impure•7mo ago

> In a move that at first went unnoticed

Oh, I noticed. I've also complained how Gemini 2.0 Flash is 50% more expensive than Gemini 1.5 Flash for small requests.

Also I'm sure if Google wanted to price Gemini 2.5 Flash cheaper they could. The reason they won't is because there is almost zero competition at the <10 cents per million input token area. Google's answer to the 10 cents per million input token area is 2.5 Flash Lite which they say is equivalent to 2.0 Flash at the same cost. Might be a bit cheaper if you factor in automatic context caching.

Also the quadratic increase is valid but it's not as simple as the article states due to caching. And if it was a bit issue Google would impose tiered pricing like they do for Gemini 2.5 Pro.

And for what it's worth I've been playing around with Gemma E4B on together.ai. It takes 10x as long as Gemini 2.5 Flash Lite and it sucks at multilingual. But other than that it seems to produce acceptable results and is way cheaper.

antirez•7mo ago

I think providers are making a mistake in simplifying prices at all costs, hiding the quadratic nature of attention. People can understand the pricing anyway, even if more complex, by having a tool that let them select a prompt and a reply length and see the cost, or fancy 3D graphs that capture the cost surface of different cases. People would start sending smaller prompts and less context when less is enough, and what they pay would be more related to the amount of GPU/TPU/... power they use.

Havoc•7mo ago

I don’t think it’s right to right a technical floor into this.

It could just as well have been Google reducing subsidisation. From the outside that would look exactly the same

notphilipmoran•7mo ago

I feel that the details regarding the type of the model and the purpose it serves are underrepresented here. Yes existing models will get cheaper over time as they become more obsolete but to be at the forefront of innovation or models costs will only increase due to the these mentioned bottlenecks. Also there is the basic law of supply and demand coming into play here. As models get more advanced more industries will be exposed to them and see the potential cost savings compared to the current alternative. This will further increase demand and with further innovation, there will be further capability in turn again increasing demand. I only see this reversing is you are not at the forefront of innovation and many people using these LLMs at this point are close at least compared to many "normal" people and their understandings of LLMs.

lemming•7mo ago

Can anyone explain the economics of Anthropic's Max plan pricing to me? I have friends on the $100/month plan using well over $800 of tokens per month with Claude Code (according to ccusage). I certainly don't use Claude Code as much if I'm not on a flat rate plan, the cost spirals out of control very quickly. I understand that a subscription makes for more predictable revenue and that there will be people on the Max plan not using Claude Code 24/7, but the delta between what the API costs and what using the Max plan with Claude Code costs just seems too great for that to be an explanation. I don't think that user/mindshare capture can fully explain it either, Code is free and the cost of switching to something else if pricing later changes is just too low. I don't get it.

ido•7mo ago

We’re in an LLM bubble and their money is cheap as they’re drowning in investor money and have to spend it + show growth. If it doesn’t make economic sense you probably can’t count on it to last once the bubble bursts.

x-complexity•7mo ago

The article assumes that there will be no architectural improvements / migrations in the future, & that Sparse MoE will always stay. Not a great foundation to build upon.

Personally, I'm rooting for RWKV / Mamba2 to pull through, somehow. There's been some work done to increase their reasoning depths, but transformers still beat them without much effort.

https://x.com/ZeyuanAllenZhu/status/1918684269251371164

NetRunnerSu•7mo ago

In fact, what you need is a dynamic sparse live hyperfragmented Transformer MoE, rather than a product like RNN that is destined to be backward...

In terms of microbiology, the architecture of Transformer is more in line with the highly interconnected global receptive field of neurons

https://github.com/dmf-archive/PILF

gchamonlive•7mo ago

> This is the first time a major provider has backtracked on the price of an established model. While it may seem like a simple adjustment, we believe this signals a turning point. The industry is no longer on an endless downward slide of cost. Instead, we’ve hit a fundamental soft floor on the cost of intelligence, given the current state of hardware and software.

That is assuming pricing and price drops only occurred because of cost reductions caused by technical advancements. While that certainly played a role in it, that disregards the role investiment money takes.

Maybe we've hit a wall in the "Moore's law for AI", or maybe it's just harder to justify these massive investments while all you have to show for are marginal improvements to the eyes of these investors, which are becoming increasingly anxious to have their money back.

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

Haskell for all: Beyond agentic coding

SectorC: A C Compiler in 512 bytes (2023)

Speed up responses with fast mode

Software factories and the agentic moment

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Homeland Security Spying on Reddit Users

Hoot: Scheme on WebAssembly

LLMs as the new high level language

Stories from 25 Years of Software Development

Total Surface Area Required to Fuel the World with Solar (2009)

First Proof

Vocal Guide – belt sing without killing yourself

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Al Lowe on model trains, funny deaths and working with Disney

FDA intends to take action against non-FDA-approved GLP-1 drugs

Vouch

Show HN: A luma dependent chroma compression algorithm (image compression)

Show HN: Axiomeer – An open marketplace for AI agents

Start all of your commands with a comma (2009)

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

I write games in C (yes, C) (2016)

Learning from context is harder than we thought

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Selection rather than prediction

Where did all the starships go?

Unseen Footage of Atari Battlezone Arcade Cabinet Production

The F Word

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

Haskell for all: Beyond agentic coding

SectorC: A C Compiler in 512 bytes (2023)

Speed up responses with fast mode

Software factories and the agentic moment

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Homeland Security Spying on Reddit Users

Hoot: Scheme on WebAssembly

LLMs as the new high level language

Stories from 25 Years of Software Development

Total Surface Area Required to Fuel the World with Solar (2009)

First Proof

Vocal Guide – belt sing without killing yourself

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Al Lowe on model trains, funny deaths and working with Disney

FDA intends to take action against non-FDA-approved GLP-1 drugs

Vouch

Show HN: A luma dependent chroma compression algorithm (image compression)

Show HN: Axiomeer – An open marketplace for AI agents

Start all of your commands with a comma (2009)

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

I write games in C (yes, C) (2016)

Learning from context is harder than we thought

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Selection rather than prediction

Where did all the starships go?

Unseen Footage of Atari Battlezone Arcade Cabinet Production

The F Word

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

The End of Moore's Law for AI? Gemini Flash Offers a Warning

Comments