And "good" is still questionable. The thing that makes this stuff useful is when it works instantly like magic. Once you find yourself fiddling around with subpar results at slower speeds, essentially all of the value is gone. Local models have come a long way but there is still nothing even close to Claude levels when it comes to coding. I just tried taking the latest Qwen and GLM models for a spin through OpenRouter with Cline recently and they feel roughly on par with Claude 3.0. Benchmarks are one thing, but reality is a completely different story.
Coderunner-UI: https://github.com/instavm/coderunner-ui
Coderunner: https://github.com/instavm/coderunner
As the hardware continues to iterate at a rapid pace, anything you pick up second-hand will still deprecate at that pace, making any real investment in hardware unjustifiable.
Coupled with the dramatically inferior performance of the weights you would be running in a local environment, it's just not worth it.
I expect this will change in the future, and am excited to invest in a local inference stack when the weights become available. Until then, you're idling a relatively expensive, rapidly depreciating asset.
You're not OpenAI or Google. Just use pytorch, opencv, etc to build the small models you need.
You don't need Docker even! You can share over a simple code based HTTP router app and pre-shared certs with friends.
You're recreating the patterns required to manage a massive data center in 2-3 computers in your closet. That's insane.
I never paid for cloud infrastructure out of pocket, but still became the go-to person and achieved lead architecture roles for cloud systems, because learning the FOSS/local tooling "the hard way" put me in a better position to understand what exactly my corporate employers can leverage with the big cash they pay the CSPs.
The same is shaping up in this space. Learning the nuts and bolts of wiring systems together locally with whatever Gen AI workloads it can support, and tinkering with parts of the process, is the only thing that can actually keep me interested and able to excel on this front relative to my peers who just fork out their own money to the fat cats that own billions worth of compute.
I'll continue to support efforts to keep us on the track of engineers still understanding and able to 'own' their technology from the ground up, if only at local tinkering scale
I feel like they actually used docker for just the isolation part or as a sandbox (technically they didn't use docker but something similar to it for mac (apple containers) ) I don't think that it has anything to do with k8s or scalability or pre shared cert or http router :/
For other use-cases, like translations or basic queries, there's a "good enough".
And I expect that over time the gap will narrow. Sure, it's likely that commercially-built LLMs will be a step ahead of the open models, but -- just to make up numbers -- say today the commercially-built ones are 50% better. I could see that narrowing to 5% or something like that, after some number of years have passed. Maybe 5% is a reasonable trade-off for some people to make, depending on what they care about.
Also consider that OpenAI, Anthropic, et al. are all burning through VC money like nobody's business. That money isn't going to last forever. Maybe at some point Anthropic's Pro plan becomes $100/mo, and Max becomes $500-$1000/mo. Building and maintaining your own hardware, and settling for the not-quite-the-best models might be very much worth it.
And my phone uses a tiny, tiny amount of power, comparatively, to do so.
CPU extensions and other improvements will make AI a simple, tiny task. Many of the improvements will come from robotics.
We have long entered an era where computing is becoming more expensive and power hungry, we're just lucky regular computer usage has largely plateaued at a level where the already obtained performance is good enough.
But major leaps are a lot more costly these days.
I remember Uber and AirBnB used to seem like unbelievably good deals, for example. That stopped eventually.
And Uber is still big but about 30% of the time in places I go to, in Europe, it's just another website/app to call local taxis from (medallion and all). And I'm fairly sure locals generally just use the website/app of the local company, directly, and Uber is just a frontend for foreigners unfamiliar with that.
Seems plausible the same goes for AI.
But the open db got good enough that you need to justify not using them with specific reasons why.
That seems at least as likely an outcome for models as they continue to improve infinitely into the stars.
However, small models are continuing to improve at the same time that large RAM capacity computing hardware is becoming cheaper. These two will eventually intersect at a point where local performance is good enough and fast enough.
If Cloud LLMs have 10 IQ points > local LLM, within a month, you'll notice you'll be struggling behind the dude who just used Cloud LLM.
LocalLlama is for hobbies or your job depends on running locallama.
This is not one-time upfront setup cost vs payoff later tradeoff. It is a tradeoff you are making every query which compounds pretty quickly.
Edit : I expect nothing better than downvotes from this crowd. How HN has fallen on AI will be a case study for the ages
Not really? The people who do local inference most (from what I've seen) are owners of Apple Silicon and Nvidia hardware. Apple Silicon has ~7 years of decent enough LLM support under it's belt, and Nvidia is only now starting to depreciate 11-year-old GPU hardware in drivers.
If you bought a decently powerful inference machine 3 or 5 years ago, it's probably still plugging away with great tok/s. Maybe even faster inference because of MoE architectures or improvements in the backend.
I think this is the difference between people who embrace hobby LLMs and people who don’t:
The token/s output speed on affordable local hardware for large models is not great for me. I already wish the cloud hosted solutions were several times faster. Any time I go to a local model it feels like I’m writing e-mails back and forth to an LLM, not working with it.
And also, the first Apple M1 chip was released less than 5 years ago, not 7.
Do you have a good accelerator? If you're offloading to a powerful GPU it shouldn't feel like that at all. I've gotten ChatGPT speeds from a 4060 running the OSS 20B and Qwen3 30B models, both of which are competitive with OpenAI's last-gen models.
> the first Apple M1 chip was released less than 5 years ago
Core ML has been running on Apple-designed silicon for 8 years now, if we really want to get pedantic. But sure, actual LLM/transformer use is a more recent phenomenon.
And that’s fine! But then people come into the conversation from Claude Code and think there’s a way to run a coding assistant on Mac, saying “sure it won’t be as good as Claude Sonnet, but if it’s even half as good that’ll be fine!”
And then they realize that the heavvvvily quantized models that you can run on a mac (that isn’t a $6000 beast) can’t invoke tools properly, and try to “bridge the gap” by hallucinating tool outputs, and it becomes clear that the models that are small enough to run locally aren’t “20-50% as good as Claude Sonnet”, they’re like toddlers by comparison.
People need to be more clear about what they mean when they say they’re running models locally. If you want to build an image-captioner, fine, go ahead, grab Gemma 7b or something. If you want an assistant you can talk to that will give you advice or help you with arbitrary tasks for work, that’s not something that’s on the menu.
For inference purposes, though, compute shaders have worked fine for all 3 manufacturers. It's really only Nvidia users that benefit from the wealth of finetuning/training programs that are typically CUDA-native.
Can you explain your rationale? It seems that the worst case scenario is that your setup might not be the most performant ever, but it will still work and run models just as it always did.
This sounds like a classical and very basic opex vs capex tradeoff analysis, and these are renowned for showing that on financial terms cloud providers are a preferable option only in a very specific corner case: short-term investment to jump-start infrastructure when you do not know your scaling needs. This is not the case for LLMs.
OP seems to have invested around $600. This is around 3 months worth of an equivalent EC2 instance. Knowing this, can you support your rationale with numbers?
Open models are trained on modern hardware and will continue to take advantage of cutting edge numeric types, and older hardware will continue to suffer worse performance and larger memory requirements.
That's fine. The point is that yesterday's hardware is quite capable of running yesterday's models, and obviously it will also run tomorrow's models.
So the question is cost. Capex vs opex. The fact is that buying your own hardware is proven to be far more cost-effective than paying cloud providers to rent some cycles.
I brought data to the discussion: for the price tag of OP's home lab, you only afford around 3 months worth of an equivalent EC2 instance. What's your counter argument?
You're right about the cost question, but I think the added dimension that people are worried about is the current pace of change.
To abuse the idiom a bit, yesterday's hardware should be able to run tomorrow's models, as you say, but it might not be able to run next month's models (acceptably or at all).
Fast-forward some number of years, as the pace slows. Then-yesterday's hardware might still be able to run next-next year's models acceptably, and someone might find that hardware to be a better, safer, longer-term investment.
I think of this similarly to how the pace of mobile phone development has changed over time. In 2010 it was somewhat reasonable to want to upgrade your smartphone every two years or so: every year the newer flagship models were actually significantly faster than the previous year, and you could tell that the new OS versions would run slower on your not-quite-new-anymore phone, and even some apps might not perform as well. But today in 2025? I expect to have my current phone for 6-7 years (as long as Google keeps releasing updates for it) before upgrading. LLM development over time may follow at least a superficially similar curve.
Regarding the equivalent EC2 instance, I'm not comparing it to the cost of a homelab, I'm comparing it to the cost of an Anthropic Pro or Max subscription. I can't justify the cost of a homelab (the capex, plus the opex of electricity, which is expensive where I live), when in a year that hardware might be showing its age, and in two years might not meet my (future) needs. And if I can't justify spending the homelab cost every two years, I certainly can't justify spending that same amount in 3 months for EC2.
I repeat: OP's home server costs as much as a few months of a cloud provider's infrastructure.
To put it another way, OP can buy brand new hardware a few times per year and still save money compared with paying a cloud provider for equivalent hardware.
> Regarding the equivalent EC2 instance, I'm not comparing it to the cost of a homelab, I'm comparing it to the cost of an Anthropic Pro or Max subscription.
OP stated quite clearly their goal was to run models locally.
However every time I run local models on my MacBook Pro with a ton of RAM, I’m reminded of the gap between local hosted models and the frontier models that I can get for $20/month or nominal price per token from different providers. The difference in speed and quality is massive.
The current local models are very impressive, but they’re still a big step behind the SaaS frontier models. I feel like the benchmark charts don’t capture this gap well, presumably because the models are trained to perform well on those benchmarks.
I already find the frontier models from OpenAI and Anthropic to be slow and frequently error prone, so dropping speed and quality even further isn’t attractive.
I agree that it’s fun as a hobby or for people who can’t or won’t take any privacy risks. For me, I’d rather wait and see what an M5 or M6 MacBook Pro with 128GB of RAM can do before I start trying to put together another dedicated purchase for LLMs.
So that’s a real brick wall for a lot of people. It doesn’t matter how smart a local model is if it can’t put that smartness to work because it can’t touch anything. The difference between manually copy/pasting code from LM Studio and having an assistant that can read and respond to errors in log files is light years. So until this situation changes, this asterisk needs to be mentioned every time someone says “You can run coding models on a MacBook!”
I have a ton of respect for SGLang as a runtime. I'm hoping something can be done there. https://github.com/sgl-project/sglang/discussions/4461 . As noted in that thread, it is really great that Qwen3-Coder has a tool-parser built-in: hopefully can be some kind useful reference/start. https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct/b...
And there are plenty of ways to fit these models! A Mac Studio M3 Ultra with 512 GB unified memory though has huge capacity, and a decent chunk of bandwidth (800GB/s. Compare vs a 5090's ~1800GB/s). $10k is a lot of money, but that ability to fit these very large models & get quality results is very impressive. Performance is even less, but a single AMD Turin chip with it's 12-channels DDR5-6000 can get you to almost 600GB/s: a 12x 64GB (768GB) build is gonna be $4000+ in ram costs, plus $4800 for for example a 48 core Turin to go with it. (But if you go to older generations, affordability goes way up! Special part, but the 48-core 7R13 is <$1000).
Still, those costs come to $5000 at the low end. And come with much less token/s. The "grid compute" "utility compute" "cloud compute" model of getting work done on a hot gpu with a model already on it by someone else is very very direct & clear. And are very big investments. It's just not likely any of us will have anything but burst demands for GPUs, so structurally it makes sense. But it really feels like there's only small things getting in the way of running big models at home!
Strix Halo is kind of close. 96GB usable memory isn't quite enough to really do the thing though (and only 256GB/s). Even if/when they put the new 64GB DDR5 onto the platform (for 256GB, lets say 224 usable), one still has to sacrifice quality some to fit 400B+ models. Next gen Medusa Halo is not coming for a while, but goes from 4->6 channels, so 384GB total: not bad.
(It sucks that PCIe is so slow. PCIe 5.0 is only 64GB/s one-direction. Compared to the need here, it's no-where near enough to have a big memory host and smaller memory gpu)
I don't think that's a likely future, when you consider all the big players doing enormous infrastructure projects and the money that this increasingly demands. Powerful LLMs are simply not a great open source candidate. The models are not a by-product of the bigger thing you do. They are the bigger thing. Open sourcing a LLM means you are essentially investing money to just give it away. That simply does not make a lot of sense from a business perspective. You can do that in a limited fashion for a limited time, for example when you are scaling or it's not really your core business and you just write it off as expenses, while you try to figure yet another thing out (looking at you Meta).
But with the current paradigm, one thing seems to be very clear: Building and running ever bigger LLMs is a money burning machine the likes of which we have rarely or ever seen, and operating that machine at a loss will make you run out of any amount of money really, really fast.
I'm really hoping for that too. As I've started to adopt Claude Code more and more into my workflow, I don't want to depend on a company for day-to-day coding tasks. I don't want to have to worry about rate limits or API spend, or having to put up $100-$200/mo for this. I don't want everything I do to be potentially monitored or mined by the AI company I use.
To me, this is very similar to why all of the smart-home stuff I've purchased all must have local control, and why I run my own smart-home software, and self-host the bits that let me access it from outside my home. I don't want any of this or that tied to some company that could disappear tomorrow, jack up their pricing, or sell my data to third parties. Or even use my data for their own purposes.
But yeah, I can't see myself trying to set any LLMs up for my own use right now, either on hardware I own, or in a VPS I manage myself. The cost is very high (I'm only paying Anthropic $20/mo right now, and I'm very happy with what I get for that price), and it's just too fiddly and requires too much knowledge to set up and maintain, knowledge that I'm not all that interested in acquiring right now. Some people enjoy doing that, but that's not me. And the current open models and tooling around them just don't seem to be in the same class as what you can get from Anthropic et al.
But yes, I hope and expect this will change!
Unless you're a billionaire with pull, you're building tools you cant control, cant own and are ephermap wisps.
That's even if you can even trust these large models in consistency.
Also, the term “remote code execution” in the beginning is misused. Ironically, remote code execution refers to execution of code locally - by a remote attacker. Claude Code does in fact have that, but I’m not sure if that’s what they’re referring to.
If you put a remote LLM in the chain than it is 100% going to inadvertently send user data up to them at some point.
e.g. if I attach a PDF to my context that contains private data, it WILL be sent to the LLM. I have no idea what "operating blind" means in this context. Connecting to a remote LLM means your outgoing requests are tied to a specific authenticated API key.
Blazing-fast, cross-platform, and supports nearly all recent OS models.
I'm working on something similar focused on being able to easily jump between the two (cloud and fully local) using a Bring Your Own [API] Key model – all data/config/settings/prompts are fully stored locally and provider API calls are routed directly (never pass through our servers). Currently using mlc-llm for models & inference fully local in the browser (Qwen3-1.7b has been working great)
But I still hope that we can someday actually have some meaningful improvements in speed too. Diffusion models seem to be really fast in architecture.
Incidentally, I decided to try to Ollama macOS app yesterday, and the first thing it tries to do upon launch is try to connect to some google domain. Not very private.
I configure them both to use local ollama, block their outbound connections via little snitch, and they just flat out don’t work without the ability to phone home or posthog.
Super disappointing that Cline tries to do so much outbound comms, even after turning off telemetry in the settings.
Supports MLX on Apple silicon. Electron app.
There is a CI to build downloadable binaries. Looking to make a v0.1 release.
A complementary challenge is the knowledge layer: making the AI aware of your personal data (emails, notes, files) via RAG. As soon as you try this on a large scale, storage becomes a massive bottleneck. A vector database for years of emails can easily exceed 50GB.
(Full disclosure: I'm part of the team at Berkeley that tackled this). We built LEANN, a vector index that cuts storage by ~97% by not storing the embeddings at all. It makes indexing your entire digital life locally actually feasible.
Combining a local execution engine like this with a hyper-efficient knowledge index like LEANN feels like the real path to a true "local Jarvis."
Code: https://github.com/yichuan-w/LEANN Paper: https://arxiv.org/abs/2405.08051
Are there projects that implement this same “pruned graph” approach for cloud embeddings?
In 2025 I would consider this a relatively meager requirement.
However, the 50GB figure was just a starting point for emails. A true "local Jarvis," would need to index everything: all your code repositories, documents, notes, and chat histories. That raw data can easily be hundreds of gigabytes.
For a 200GB text corpus, a traditional vector index can swell to >500GB. At that point, it's no longer a "meager" requirement. It becomes a heavy "tax" on your primary drive, which is often non-upgradable on modern laptops.
The goal for practical local AI shouldn't just be that it's possible, but that it's also lightweight and sustainable. That's the problem we focused on: making a comprehensive local knowledge base feasible without forcing users to dedicate half their SSD to a single index.
This shows how little native app training data is even available.
People rarely write blog posts about designing native apps, long winded medium tutorials don't exist, heck even the number of open source projects for native desktop apps is a small percentage compared to mobile and web apps.
Historically Microsoft paid some of the best technical writers in the world to write amazing books on how to code for Windows (see: Charles Petzold), but now days that entire industry is almost dead.
These types of holes in training data are going to be a larger and larger problem.
Although this is just representative of software engineering in general - few people want to write native desktop apps because it is a career dead end. Back in the 90s knowing how to write Windows desktop apps was great, it was pretty much a promised middle class lifestyle with a pretty large barrier to entry (C/C++ programming was hard, the Windows APIs were not easy to learn, even though MS dumped tons of money into training programs), but things have changed a lot. Outside of the OS vendors themselves (Microsoft, Apple) and a few legacy app teams (Adobe, Autodesk, etc), very few jobs exist for writing desktop apps.
shaky•3h ago
sneak•2h ago
frank_nitti•2h ago
Imustaskforhelp•2h ago
Its just more freedom and privacy in that matter.
doctorpangloss•1h ago
frank_nitti•1h ago