Local AI needs to be the norm

https://unix.foo/posts/local-ai-needs-to-be-norm/

112•cylo•3h ago

Comments

sgt•3h ago

I guess Google got that memo!

williamtrask•1h ago

I wonder if a popularization moment for local AI will ultimately be the pin-prick that pops the AI bubble. Like the deepseek or openclaw moments but bigger/next.

gdulli•13m ago

That's like wondering if enough people discovering local media streaming will disrupt commercial streaming services. It's not going to happen. Most people are not ambitious and will let themselves be controlled by the services of least resistance.

And you can't take comfort in knowing that you, personally, will remain in control of your own computing. The majority will let the range and direction of their thoughts and output be determined by the will of the tech giant whose AI they adopt. And that will shape society.

Galanwe•58m ago

I would love for local inference to be possible, but from my experience, Kimi 2.6 is the only model that would be worth it, and its a $10k (M3 Ultra max spec'd - 30s TTFT so kind of slowish) to $30k (RTX6000/700GB+ DDR5) upfront, noise / power consumption aside.

mft_•54m ago

You're maybe missing the article's point, which is to use local models appropriately:

> “But Local Models Aren’t As Smart”

> Correct.

> But also so what?

> Most app features don’t need a model that can write Shakespeare, explain quantum mechanics, and pass the bar exam. They need a model that can do one of these reliably: summarize, classify, extract, rewrite, or normalize.

> And for those tasks, local models can be truly excellent.

Galanwe•40m ago

This is a bit naive IMHO...

I have tried quite a bunch of local models, and the reality is that it's not just a matter of of "it's a small model that should be hostable easily". Its also a matter of whats your acceptable prefill TTFT and decode t/s.

All the local models I used, on a _consumer grade_ server (32GB DDR5, AMD Ryzen) have been mostly unusable interactively (no use as coding agent decently possible), and even for things like classification, context size is immediatly an issue.

I say that with 6m experience running various local models for classifying and summarizing my RSS feeds. Just offline summarizing ans tagging HN articles published on the front page barely make the queue sustainable and not growing continuously.

mikrl•39m ago

One of my hobbyist workflows involved transcribing ETF prospecti into yaml for an optimizer to optimize over.

Used to take me maybe 10-20 minutes per sheet.

Then I got codex to whip up a script that sends each sheet to a fairly low parameter locally running LLM and I have the yaml in a couple seconds.

My dream is to bootstrap myself to local productivity with providers… I know I’ll never get there because hedonic treadmill etc, but I do feel there’s lots more juice to squeeze. I just need to invest more time into AI engineering…

jjordan•51m ago

It feels like we're one technological breakthrough away from all of these data centers going up to be deemed irrelevant.

i_love_retros•49m ago

What would that breakthrough be?

Waterluvian•48m ago

Magic math and computer science that allows us to get the same quality response for a fraction of the GPU.

toufka•27m ago

I mean, the most cutting edge of iPhones, iPads and MacBook Pros _today_ are quite capable of running in realtime today’s high-end local LLMs.

If you project out that hardware just a couple of years, and the trained models out a couple of years, you end up in a place where it makes so much more sense to run them locally, for all sorts of latency, privacy, efficacy, and domain-specific reasons.

Not all that different from the old terminal & mainframe->pc shifts.

Finally - hardware has seemingly gotten out ahead of software that most folks use - watching YouTube, listening to music, playing a game or two. There was a time when playing an mp3 or watching a 4k video really taxed all but the nicest systems. Hardware fixed that problem, like it very well could this one.

intothemild•26m ago

That's already happening. Qwen3.6 and Gemma4.

Basically small and medium models that are crazy well trained for their sizes.

Then we have a lot of specular decoding stuff like MTP and others coming to speed up responses, and finally better quantisation to use less memory.

Local LLM is the future, and the larger labs know that the open models will eat their lunch once people realise that the gap is only a few months. If we were good with LLMs a couple months ago, we're good with the open models now.

YZF•19m ago

The current LLMs are also "magic" so anything is possible. AFAIK there is no proof that the current architecture is optimal. And we have our brains as a pretty powerful local thinking machine as a counter-example to the idea that thinking has to happen in data centers.

_heimdall•4m ago

I'd assume its a totally different architecture that isn't based on storing a compressed dataset of all digital human text.

Lalabadie•35m ago

The cynical take is getting more and more to be the only rational one:

The promised mega-data center deals are meant to boost valuations today, not serve tons of customers three years from now.

jjordan•17m ago

oof, this bubble popping is gonna be brutal.

_heimdall•5m ago

It seems pretty clearly inline with the dotcom bubble to me. Every company claims to be a leading AI company, those building infrastructure are promising the moon and getting 1/3 of the way there, and no one knows how to monetize it justify the hype or expense.

timeattack•45m ago

My problem with LLMs (apart from philosophical aspects and economical impact) is that it would be unlikely for any of us to be able to train something functional locally (toy-like LLMs -- sure, but something really useful -- no). Apart from that it requires immense computing power, it also requires a dataset which is for the most part is obtained illegally.

cyanydeez•43m ago

That sounds like government. So your problem is mostly that you expect to have a collective social effort, but not enough to pay for it as a public good.

Ucalegon•35m ago

Depends on the domain. There are plenty of different use cases where the data needed for training is available for personal, or non-commercial, use. At that point, it does come down to compute/time to do the training, which if you are willing to wait, consumer grade hardware is perfectly capable of developing useful models.

kibwen•30m ago

This seems overly pessimistic.

I may personally be of modest intelligence, but to acquire the intelligence that I do have, I did not need to train on every book ever written, every Wikipedia article ever written, every blog post ever written, every reference manual ever written, every line of code ever written, and so on. In fact, I didn't train on even 1% of those materials, or even 0.00000000001% of those. The texts themselves were demonstrably not a prerequisite for intelligence.

At minimum, given that it only took me about 20 years of casual observation of my surroundings to approximate intelligence, this is proof positive that the only "dataset" you need is a bunch of sensors and the world around you.

And yes, of course, the human brain does not start from zero; it had a few million years of evolution to produce a fertile plot for intelligence to take root. But that fundamental architecture is fairly generic, and does not at all seem predicated on any sort of specific training set. You could feasibly evolve it artificially.

_heimdall•21m ago

You're also embodied and experiencing the world around you with more senses than only the ability to read text.

dlcarrier•5m ago

Not the whole thing, at least with current technology, but LoRAs are really good at fine tuning, and can be generated in a few hours on high-end gaming computers, so as long as the base model is in your language, you likely have enough spate computing power, in whatever electronics you own, to train a few LoRAs a month.

In the future, when regular home computers have the capabilities of modern servers, we'll be able to train the entire LLM at home.

revolvingthrow•44m ago

A local Answer Machine is the dream, especially when the internet is decaying and generally on its last legs, but the hardware requirements seem like a huge mountain to climb. Things are progressing tremendously - deepseek v4 flash is very good for what it is - but even that goes beyond any reasonable local setup, which imo is 128 GB ram + 16 GB vram. 4 ram slots on a consumer board craters ram speed, 256 gb macs are too expensive, and even then the inference is ungodly slow.

On the other hand… v4 flash model is actual magic compared to what was available 2 years ago. If the rate of improvement stays as is, we’ll get a similar performance in a ~120B model in a year, which is viable (if expensive) for everyman hardware. Possibly you’ll be able to run its equivalent on a ~$1200 laptop by 2028, which for me-in-2020 would sound straight out of a scifi movie. A good harness that lets the model fetch data from other sources like a local wikipedia copy from kiwix could do a lot for factual knowledge, too; there’s only so much you can encode in the model itself, but even a cheapish (pre-curent prices) 2TB drive can hold an immense amount of LLM-accessible data.

Big caveat: I don’t see local models for programming or generally demanding agentic tasks being worth it anytime soon. You likely want bleeding edge models for it, and speed is far more important. Chat at 20tok/s is fine; working on even a small codebase at 20tok/s, especially on a noticeably weaker model, is just a waste of time. Maybe it’s a PEBKAC but I have no idea how people make any meaningful use out of qwen 3.6.

agentifysh•43m ago

Until the hardware is economical and powerful enough, local AI that can compete with frontier models today is still far off.

If we could even get something like GPT 5.5 running locally that would be quite useful.

vegabook•42m ago

>> years ago I launched "The Brutalist Report"

proceeds to brutalise the reader with an 88-point headline font.

hypfer•41m ago

Same as local compute.

Welcome back to 2014. Let us now continue yelling at the cloud.

artursapek•39m ago

I'm someone who is trying to build a subscription-based business to cover underlying LLM costs, and very hopeful I can one day just sell a permanent license to the software instead with customers using local LLMs to power it.

TheJCDenton•35m ago

For the mainstream audience, the sentiment around local ai today is the same that they had around open source a few decades ago. For a few products, some paid solutions were so much more advanced that open source were very often completely overlooked. Why bother ? And the like. Then we had captive SaaS and other plateforms and now it's obviously wrong for most of us.

The dependency we have with anthropic and openai for coding for instance is insane. Most accept it because either they don't care, or they just hope chinese will never stop open weights. The business model of open weights is very new, include some power play between countries and labs, and move an absurd amount of money without any concrete oversight from most people.

It's a very dangerous gamble. Today incredible value is available for nearly everyone. But it may stop without any warning, for reason outside our control.

oytis•30m ago

What is the business model of open weight AI? I don't think there is any. At best it can serve as an advertisement for the more advanced models you sell.

The huge difference to open source is that you can't just train an LLM with free time and motivation. You need lots of data and a lot of compute.

I sure want to be wrong on that, I definitely like the open-weight version of the future more

worldsayshi•25m ago

It should be feasible to crowd fund training runs right?

dmd•6m ago

A training run costs somewhere in the neighborhood of a billion dollars. That’s a thousand millions.

How many crowdfunded projects do you know that have raised even one percent of that? Who’s going to be in charge of collecting that scale of money? Perhaps some sort of company formed for the benefit of humanity, which will promise to be a non-profit? Some sort of “Open” AI?

Oh, wait.

PAndreew•20m ago

Perhaps you can create a compelling UX around it and sell it as a subscription. "Normies" will not be able/willing to build it. You can then patch the model/ship new features around it as it evolves. For example I have built an ambient todo list / health data extractor using Gemma 4 2EB and Whisper. Nothing to brag about but it does fairly decent job even in foreign languages.

karussell•17m ago

> What is the business model of open weight AI?

This is what I do not understand as well and advertising the knowledge and more advanced model is also the only thing that comes to my mind.

Since a month I am using gemma4 locally successfully on a MBP M2 for many search queries (wikipedia style questions) and it is really good, fast enough (30-40t/s) and feels nice as it keeps these queries private. But I don't understand why Google does this and so I think "we" need to find a better solution where the entire pipeline is open and the compute somehow crowdfunded. Because there will be a time when these local models will get more closed like Android is closing down. One restriction they might enforce in the future could be that they cripple the models down for "sensitive" topics like cybersecurity or health topics. Or the government could even feel the need to force them to do so.

2ndorderthought•9m ago

Why would you want to try to support all users simple queries on your ai data center if they could run it on their own computer?

It builds good will also. it also shows research prowess.

For China it's different. They need to show Americans who don't trust them at all because of propaganda that they have no tricks up their sleeve. It also doesn't hurt when Chinese companies drop models for free people can run at home that are about as good as sonnet for free. Serious mic drop.

karussell•4m ago

Indeed cost can be another factor. Maybe also the main reason why Chrome added an offline model.

aabhay•12m ago

Disagree with this. When cost becomes an important factor or the free but worse option becomes compelling and accessible (i.e. on device agent via apple style UX), there has been significant user behavior towards local. Think about stuff like removing backgrounds from photos, OCR on PDFs, who uses paid services for casual usage of these things?

shmerl•35m ago

Depending on some remote AI provider is a major lock-in pitfall. But it's exactly what those AI providers want you to do.

cubefox•31m ago

Local AI is a bit like wind parks. Everyone is in favor, except if they are in your own backyard. There was recently a huge outcry when Chrome shipped a local 4 GB AI model: https://news.ycombinator.com/item?id=48019219

I have to conclude that people would like to have powerful local AI but it should at the same time only be a tiny model. In which case it wouldn't be powerful.

barrkel•31m ago

Local models are extraordinarily expensive if you're not maximizing throughput, and you're not going to be maximizing it.

Local models need to be resident in expensive RAM, the kind that has fat pipes to compute. And if you have a local app, how do you take a dependency on whatever random model is installed? Does it support your tool calling complexity? Does it have multimodal input? Does it support system messages in the middle of the conversation or not? Is it dumb enough to need reminders all the time?

Spend enough time building against local models and you'll see they're jagged in performance. You need to tune context size, trade off system message complexity with progressive disclosure. You simply can't rely on intelligence. A bunch of work goes into the harness.

Meanwhile, third party inference is getting the benefits of scale. You only need to rent a timeslice of memory and compute. It's consistent and everybody gets the same experience. And yes, it needs paying for, but the economics are just better.

bheadmaster•29m ago

> And if you have a local app, how do you take a dependency on whatever random model is installed?

Why not ship your own model? In the age of Electron apps, 10GB+ apps are not unheard of.

_heimdall•23m ago

Personally I wouldn't want a couple dozen apps installed all with their own model.

It seems easier to have industry specs that define a common interface for local models.

I also assume the OS can, or would need to, be involved in proving the models. That may not be a good thing depending on your views of OS vendors, but sharing a single local model does seem more like an OS concern.

alex7o•21m ago

I mean the openai API is the industry standard for allowing apps to communicate with models, llama-server has it, oMLX has it, ollama has it, vLLM has it, lmstudio as well. I don't think this is such a hard thing to do, but it requires people to set it up.

_heimdall•19m ago

I don't know enough about that API surface to know if its a particularly good one for the use cases we'd have, but yes defining a universal spec for all implementors to support wouldn't be a big lift and is done in plenty of other areas already.

alex7o•23m ago

There is no other way than shipping your own model, because you will want an abstracted API over the inference, and you don't know what the user has installed. Also you can ship 9b fp4 model but it all just depends

LPisGood•22m ago

You can know what the user has installed if the OS developer offers something.

_heimdall•12m ago

Knowing what's installed would have to be an OS API. If LLMs provide a standard API surface to the OS, likely including metadata related to feature support.

LPisGood•22m ago

> And if you have a local app, how do you take a dependency on whatever random model is installed?

Reading the tea leaves here, it will probably be common for OS’s to have built in models that can be accessed via API. Apple already does this.

vb-8448•23m ago

> Use cloud models only when they’re genuinely necessary.

The problem is that it's much easier to use the SOTA models (especially if they are subsidized) instead of spending time fixing the knobs with the local one.

I just realized this with coding agents, yeah, you probably shouldn't always use latest version at xhigh, but you will end doing it because you do the job in less time, with less "effort" and basically at the same price.

I guess we'll see a real effort for local AI only when major vendors will start billing based on actual token usage.

Analemma_•17m ago

I'm also just not seeing good performance from local models. Every time a thread about LLMs comes up, there are tons of people in the comments insisting that they're getting just as good results from the latest DeepSeek/qwen/whatever as with Opus, and that just hasn't been my experience at all: open-source models just fall over completely compared to Claude when asked to do anything remotely complicated.

I have a sneaking suspicion this is kinda like the situation with Linux in the 90s, where it kinda worked but it reeeeeally wasn't ready for the home user, but you had a lot of people who would insist to your face everything was fine, mostly for ideological reasons.

holtkam2•17m ago

I wish I could upvote this twice. We (devs) really REALLY need to consider on-device compute before going to the cloud for LLM inference.

eyk19•17m ago

Apple stock is going to skyrocket

mattlondon•14m ago

Yet there is another post a few rows down where people are losing their shit that Chrome has a local LLM model that uses a couple of GB of space for local-inference.

Damned if they do, damned if they don't.

dlcarrier•11m ago

Maybe don't use gigabytes of bandwidth and storage space, without asking.

aabhay•10m ago

This is a weird take. If its not opt in or you’re shoe horning it into a browser, then that sucks. Nobody is getting enraged that an app for running local LLMs downloads data to do so.

ekjhgkejhgk•9m ago

You don't understand the difference between "I run a local LLM because I chose to" vs "The browser chose to run a local LLM and I have no say"? You don't understand?

Not to mention that the LLM that I choose to run requires a monster machine and is infinitely more capable than whatever google chose to put on their browser?

I mean, none of this affects me because I don't use chrome, obviously, but you don't see the difference? Bewildering.

dana321•5m ago

"NO AI" needs to be the norm, we should be working on better ways of sharing information and better documentation instead of fighting with computers for substandard results.

wilg•3m ago

Two issues -

1. Local models are likely to be more power-expensive to run (per-"unit-of-intelligence") than remote models, due to datacenter economies of scale. People do not like to engage with this point, but if you have environmental concerns about AI, this is a pretty important point.

2. Using dumb models for simple tasks seems like a good idea, but it ends up being pretty clear pretty quick that you just want the smartest model you can afford for absolutely every task.

Hardware Attestation as Monopoly Enabler

Incident Report: CVE-2024-YIKES

Local AI needs to be the norm

Traces Of Humanity

Ask HN: What are you working on? (May 2026)

Lakebase architecture delivers faster Postgres writes

I returned to AWS and was reminded why I left

Stop MitM on the first SSH connection, on any VPS or cloud provider

Louis Rossmann offers to pay legal fees for a threatened OrcaSlicer developer

What's a mathematician to do? (2010)

YC's Biggest Scandals

Idempotency is easy until the second request is different

Show HN: An index of indie web/blog indexes

Eight More 8-bit Era Microprocessors (2024)

Space Cadet Pinball on Linux

Shunting-Yard Animation

The One Dollar Counterfeiter

Show HN: Building a web server in assembly to give my life (a lack of) meaning

Think Linear Algebra (2023)

Spain has become one of Europe’s cheapest power markets

Task Paralysis and AI

Walking slower? Your ears, not your knees, might be the problem

Casio S100X Japanese Lacquer Edition (JP Page Only)

9 Mothers (YC P26) Is Hiring

The River Otter's Remarkable Comeback

GitHub is sinking

I’ve banned query strings

The locals don't know

We see something that works, and then we understand it

Chrome's AI features may be hogging 4GB of your computer storage

Hardware Attestation as Monopoly Enabler

Incident Report: CVE-2024-YIKES

Local AI needs to be the norm

Traces Of Humanity

Ask HN: What are you working on? (May 2026)

Lakebase architecture delivers faster Postgres writes

I returned to AWS and was reminded why I left

Stop MitM on the first SSH connection, on any VPS or cloud provider

Louis Rossmann offers to pay legal fees for a threatened OrcaSlicer developer

What's a mathematician to do? (2010)

YC's Biggest Scandals

Idempotency is easy until the second request is different

Show HN: An index of indie web/blog indexes

Eight More 8-bit Era Microprocessors (2024)

Space Cadet Pinball on Linux

Shunting-Yard Animation

The One Dollar Counterfeiter

Show HN: Building a web server in assembly to give my life (a lack of) meaning

Think Linear Algebra (2023)

Spain has become one of Europe’s cheapest power markets

Task Paralysis and AI

Walking slower? Your ears, not your knees, might be the problem

Casio S100X Japanese Lacquer Edition (JP Page Only)

9 Mothers (YC P26) Is Hiring

The River Otter's Remarkable Comeback

GitHub is sinking

I’ve banned query strings

The locals don't know

We see something that works, and then we understand it

Chrome's AI features may be hogging 4GB of your computer storage

Local AI needs to be the norm

Comments