Running local models is good now

https://vickiboykis.com/2026/06/15/running-local-models-is-good-now/

151•jfb•1h ago

Comments

_doctor_love•40m ago

"Just get a 64GB Mac with 1TB of storage!"

LOL - some of us have a budget

tjwebbnorfolk•34m ago

AI and budgets don't mix well at the moment

techscruggs•33m ago

He is using a 2022 M2, which you can get that for about $2k used. That is beyond reasonable.

psychoslave•20m ago

Global Affordability Estimate:

Top 10% of global earners (~800M people) can afford a $2,000 device without major financial strain.

Top 25% (~2B people) could afford it with some budget adjustments.

Bottom 50% (~4B people) would find it prohibitively expensive.

So for a SV top income, maybe that might look more like the weekly pet brushing budget, but for most people out there this is not that much of a no-brainer.

richwater•13m ago

Yes, because the bottom 50%, mostly impoverished or near impoverished folks were spending money on Claude Code subscriptions instead /s

disgruntledphd2•7m ago

The maths changes if you're working for yourself. Because I live in Europe, I've ended up working as a contractor due to the lack of a legal entity in my country. While that mostly sucked for a bunch of reasons, I was able to get a 64Gb Mac M2 a few years back with approximately a 52% discount, which was kinda nice.

Shekelphile•14m ago

She

themythfable•33m ago

Yeah, I never had a computer that cost north of $800 until recently. While that is far from the typical HN user's budget, my bet is that it is much closer to average.

Besides those with effectively unlimited budgets for their personal compute, local models are still a long ways off.

Though, that shouldn't be conflated with the value of open-source models, which can be used by cloud providers to significantly reduce cost of intelligence.

embedding-shape•30m ago

> Yeah, I never had a computer that cost north of $800 until recently. While that is far from the typical HN user's budget, my bet is that it is much closer to average.

There are segments, everything from "Average person in world" to "Average creative professional using computers for work", with a wide range of costs for the hardware. HN probably skews towards the latter rather than the former, probably sitting with enterprise hardware next to them basically for fun, hard to make wider conclusions from what people here have or not.

sublinear•10m ago

If we define "typical" as the median HN budget, it's probably about the same as yours. Maybe the answer would have been different 10 or 20 years ago, but the era of truly needing a big budget PC has been over for a while.

It's just for gaming and AI now. Maybe not even gaming as much anymore.

Consider the perspective of someone who has a practically unlimited budget for PCs, doesn't game much anymore, and doesn't need AI to do their job. It's just part of getting older, and there are plenty of people in their late 30s and older on here.

p-e-w•31m ago

No need. You can run the Gemma 4 and Qwen3.5 MoE models with as little as 12 GB of VRAM at 30-40 tps (Q4/Q5), and they both blow GPT-4o and DeepSeek R1 out of the water.

swatcoder•30m ago

Sure, but it's also not really out of scale with the cost of a shop tool in other trades.

If you're a professional that's confident in a positive return on the investment (optimal or not), or just a hobbyist with the luxury budget for a "shop" that cost is well within norms.

That's not everybody, of course, but it's not some inconceivable fantasy. A lot of people in the tech community here on HN, specifically, end up with pretty high discretionary budgets that they pour into stuff like this.

amalcon•28m ago

A Strix Halo with similar RAM is considerably cheaper. Still not cheap, mind, but performance is OK (not great) and it will run more or less the same models.

AbsurdCensor•17m ago

At least for me, it's been pretty great, but I bought my system when it was $1800, now looks like the same system is $2700 and out of stock. I still haven't quite been able to run 120B parameter models under Windows, but for Qwen Coder 30B, it works pretty darn well for my at home needs.

amalcon•13m ago

Yeah, they have gone up a lot since I bought mine too. I did get Qwen3.5-122b running on all-GPU (on a 128GB machine) under a minimal Arch Linux setup (I do my GUI work on a much cheaper box). It worked, but Qwen3.6-35b is performing almost as well and a lot faster.

Still cheaper than a new Mac. Maybe not cheaper than a used one.

anarticle•15m ago

Pros buy their own tools. This is why working for yourself is better than working for a corpo, you get to choose your weapon.

anax32•36m ago

I've just made a milestone on my project, moving away from AWS (budget) to self-hosted and the local models are so much faster than in the past. Beyond LLMs, having embeddings, image, video, audio gen available is crazy.

Running locally is the bar; it's hard to make these things a service which scales.

richbradshaw•36m ago

I’m keen to understand speed here etc etc. if I bought a Mac studio with 96GB - what can I realistically run, how’s it compare to fable/opus etc and how fast is it?

Currently maxing out two Claude code accounts every x hours when working on large code migrations or setting up new iOS apps etc - most of time it’s fine but occasionally it’s mega frustrating!

simonw•16m ago

I strongly recommend trying LM Studio - it's the lowest friction way to try out models, you can browse https://lmstudio.ai/models and click "Get" and then "Run in LM Studio" to download and run a model.

With 96GB I'd start with the Gemma 4 and Qwen 3.6 models. Any of those should work fine.

AbsurdCensor•8m ago

I think currently you can only get the M3 Ultra Studio with 96gb, and for coding tasks, say you rub Qwen Coder on it (which doesn't need that much ram), it's not the fastest, something like 30-40 tok/sec. Probably better with a MacBook Pro with the M5 chip. There is a website for comparing different configurations and models: https://llmcheck.net/benchmarks

rmunn•34m ago

This is the kind of thing that Anthropic et al should be worried about. As it becomes easier and easier to run local models, the ceiling of what they'll be able to charge will get lower and lower. Not that nobody will be willing to pay $$$$$ per month, but a lot of people are going to multiply the per-month charge by 12 or 24 and say "Could I set up a local model for less than that, and have it pay for itself within a year or two?" And if a significant portion of customers decide to buy instead of rent, the companies whose business model is entirely centered around renting will suddenly find themselves hurting for customers.

themaninthedark•32m ago

Maybe that is why they are buying up as much hardware as they can? If their service is the only game in town.

otterdude•30m ago

Data Center providers are buying hardware, not anthropic. Certainly related but alot of the hardware purchased is just sitting in a warehouse waiting for a data center to get built.

indoordin0saur•22m ago

I'm curious when coding-heavy companies will start running their own on-prem AI clusters. Has anyone had the idea to sell something like 4 GPU machine an engineering team could throw in a closet somewhere and run whatever they want on it? I imagine this won't appeal to everybody but with the trust issues the hyperscalers have developed hoovering up people's data and using it to train their models, I imagine some will find value in a machine and model they have transparent control over including the option to walk over and unplug the thing.

sathackr

embedding-shape•33m ago

Show us the resulting code of using them! :) I want to use local models, I have the hardware for it, but while trying them out as replacements for GPT 5.5 xhigh or Opus or other SOTA models, they aren't quite ready to be replaced yet, sadly. The quality and bumps they encounter just slows down the workflow so much, even screwing up tool call syntax sometimes.

But, for smaller more well-defined workflows, or as straight "edit this part to be like this exact" edits, they seem more than enough. Still waiting for them to become mature enough to be able to replace what we have as SOTA today, I'd say it's ready to be switched over then.

Speaking of local models, DiffusionGemma (and diffusion models in general) should not be slept on for local usage! Usually the problem locally is that the LLMs aren't efficiently making use of your hardware, unless you start batching requests and run many at the same time, but that require different approaches in general. Instead, diffusion models work much faster for individual prompts, and not by a small margin either.

Today I finally finished porting diffusiongemma-26B-A4B-it support from Transformers into Candle, and together with some optimizations I now have it basically flying with ~450 tok/s (~19 it/s) in Candle during inference, instead of ~180 tok/s (~11 it/s) from HF's Transformers library. Even using vLLM with similar sized LLMs, I don't think I've ever gotten past the ~250 tok/s threshold for single prompts, exciting stuff for local models :)

zozbot234•11m ago

> Instead, diffusion models work much faster for individual prompts, and not by a small margin either.

Diffusion models can't really be trained beyond low-to-mid size and have lower quality over an equally sized, plain one-token-at-a-time model.

cube00•32m ago

The challenge I have is getting a large enough context window so tool calls work reliably, the local models easily slip into hallucinated JSON tool responses and won't trigger the tools as a result.

hypfer•31m ago

After having been a happy user of Qwen3.6-27B for a few weeks, due to being away from the hardware, I'm currently forced to use Claude Sonnet 4.6

It is such a downgrade. I don't understand how that's even possible. The thing has so many strongly-held opinions I did not ever ask it for, talking just way too much and generally feeling somehow dumber.

Of course, being significantly larger, it will encode more knowledge, but that doesn't help me when I hate talking to it. And all that on top of the fact that talking with it costs real money.

I wonder what it might be that makes me hate it so much. Maybe because it doesn't see itself as a tool but almost an equal? As if its opinions would have weight.

Qwen too can act like an overeager intern, but if you tell it that it is an idiot, it will drop that ego. Not so much with Claude. In my experience, anyway.

Anyway, point is: full ack on that headline.

kitd•25m ago

Funny that coding agents have personalities, including "that colleague" you want to avoid even if you know they're probably quite good at what they do!

MostlyStable•24m ago

Curious if you have tried custom instructions. I was never quite as unhappy with Claude's voice as you appear to be, but there were several things I didn't like. A custom prompt fixed almost all of them.

clickety_clack•19m ago

I think it would be very hard to convince someone to pay $100/mo to go back to Claude if they have a local model up and running, particularly now that model improvement has basically been stalled for the last 6 months. It’s so easy to set it up for yourself now too with things like LM studio. That said, there will always be unsophisticated users who can’t figure it out, so there will always be someone there to pay.

wxw•29m ago

> “if we are constrained by performance and price, what architectural tradeoffs do we need to make?” a question that so far has not really been asked in the mad token gold rush.

To be fair, I think the labs are also interested in this (e.g OpenAI parameter golf). But the incentives are tricky. When the subsidies and tokenmaxxing era ends, local models will be essential.

cautiouscat•24m ago

> I have no concrete scientific evidence of this - my own personal vibe metric of “is a model good enough” is, “do I have to double-check it against an API model”, and GPT-OSS was the first one where I started doing that a lot less often.

The good old butt dyno!

I’ve been eyeing local models more and more with Anthropic squeezing more and more on the subscriptions. A few comments on HN had me waiting until they improved more but this article makes me wonder if I should reconsider that.

I’ve been doing some pretty niche development using a game and a script extender for said game. If these models can handle that, I’d feel good about switching.

xienze•21m ago

The big caveat here is that these local models require you to invest some time tweaking your harness, AGENTS.md, and skills in order to get things roughly to the level you'd expect. But something like Qwen3.6-27B with web search capabilities and a good set of skills really is impressive! Especially considering that you can go wild and not worry about token costs.

The other thing that people tend to gloss over is that you really do need to spend some $$$ on decent hardware. Yeah, you CAN run some 4-bit quant with heavily quantized cache on your 16GB card, but it's not going to be a great experience (I think this is where a lot of the "if you think it's gonna be any good, you're going to be disappointed" stuff comes from). Yes it's a lot of $$$ upfront but it's very much unknown when hardware prices are going to come back to reality. There's a lot of hopes and dreams that any minute now an H100 will be worth pennies because "that's how it's always been" w.r.t. computer hardware, but we are living in interesting times. So you can't just make the tired old assumptions that a Claude subscription over three years time will work out to be dramatically less than the value of some card three years from now. We STILL have basically anything with >=24GB VRAM appreciating in value, which is absolutely wild. What I'm saying is, the depreciation curve may very well be a lot less dramatic and fast than it used to be, going forward.

sosodev•19m ago

I think this is overselling their capabilities. I've used Gemma 4 and Qwen 3.6 quite a bit on my strix halo home server. They're great models and the dense variants are significantly better, but they're still very far behind the frontier. If you boot up Gemma 4 MoE and OpenCode/Pi and expect to perform anything like Claude Code or Codex you're going to be very disappointed.

chrismarlow9•18m ago

You can use a frontier model to create a plan that's specific enough for a local model of a very small size to execute on. The more specific you are and compartmentalize tasks the "dumber" the local model can be.

Edit: Obviously you'll be using more tokens but this is the trade off for running a smaller model and running locally. Similar to time memory trade off but in token economics. Sorry I need more coffee

simonw•18m ago

I think gemma-4-26b-a4b and Qwen3.6-35B-A3B show that there's something very interesting about a local model that does mixture-of-experts (which helps a lot with performance) and has in the order of 30 billion parameters.

These models are very capable, and use around 20-30GB of RAM while they are running.

Provided you have 64GB of RAM that leaves space for running other applications at the same time.

chrisweekly•6m ago

Obtaining that 64GB RAM is a meaningful obstacle for many.

stared•17m ago

I really recommend Qwen3.6 27B.

Make some tests, and its 8 bit version runs at 30tok/s when using llama.cpp with MTP and run on Macbook Max M5. I have 128 GB, but but 64 GB is well enough. https://github.com/stared/benching-local-llms-on-apple-silic...

When using benchmarks, it gives more-or-less the level of SotA mid-late 2026.

wizzledonker•8m ago

Did you mean 2025?

ibizaman•16m ago

Tangential but reading on mobile, the font size in the code snippets are all over the place. I actually have the same issue on my blog. Anyone knows why?

aliljet•12m ago

The problem here is always the cost-benefit. For $200/mo, you're receiving subsidized best of breed access. There's no model competing for that price anywhere. If a 27B param model is what you choose, show me your hardware! I would love to be wrong...

0xc0c0c0•11m ago

I have used local models (around 128 gb) and the big proprietary models, and while I do want local models to win, it's important we keep the expectations of local models realistic. There are many blog posts about how local models today can fully replace some of the proprietary models and in some cases its true for the much smaller proprietary models, its very clearly much more behind the larger models.

You can be far more ambiguous with your tasks with the larger proprietary models as opposed to the local models. You can achieve the similar results with local models but you need to be much more detailed in your prompt.

One of the biggest things about running these local models is that the harness matters almost just as much as the model too. Codex is optimized for GPT models, CC is optimized for Claude, Cursor has a great harness that works very well across these providers. It took me a couple of iterations of the different harnesses to find one that would work well with the smaller Qwen models to do local coding.

wasimxyz•9m ago

https://canirun.ai

anubhav200•9m ago

I have been using qwen and glm based models from last 2 years, ended up buying mutiple machines for the same. Overall i feel 24vram is a must have to get get performance (speed wise) to match hosted soln. I have 2 machines a 12gb vram one and a 24gb one. On 12gb vram i get around 50tps generation and 500tps prompt processing and on 24gb one i get 180tps generation and 3500tps prompt processing. I have different configs for different scenarios and I also use llama cpp manager manage all my configs (https://github.com/anubhavgupta/llama-cpp-manager)

Why AI Will Accelerate Health Care Inflation

Connecting to a Lot of People on LinkedIn via Browser DevTools

'David Bowie was a crazy workaholic': Labyrinth at 40 – an oral history

Anthropogenic Geomaterials – What Is This Rock?

Show HN: INT21 – Self-Improving PTX Kernel Factory

Copper drug restores memory and clears toxic Alzheimer's proteins

Chainguard's new Athena coalition uses AI to fix open-source flaws

Publishers to bill AI firms for unwanted scraping or take them to court

Google Earth Flight Simulator Is Now Available

Show HN: NIS2 Readiness Check

Trump officials won't allow G7 countries to access Anthropic's advanced models

Building a Soviet Nail Factory: how KPIs killed efficiency

Cell-Based Architecture for Resilient Payment Systems

The agent is a file. Define it once, call it from anywhere

Dear Researchers: The Invisible Work

SubQ 1.1 Card: Linear-scaling sparse attention with 98% retrieval at 12M tokens [pdf]

PointlessQuest puts a full MMO on Playdate's 2.7-inch screen

Pipkin's Light Bulb Moment

SpaceX Purchases Cursor, a Claude Code and OpenAI Codex Competitor

The cost of saying no, then doing it anyway

Mental causation is not load-bearing

Show HN: Agent Harness Lab – compare agent frameworks with swappable tools

The price of liberty is eternal vigilance

Calvin and Hobbes and the Price of Integrity

A Quick Intro to Nial

MIT Grad Solved 2k Coding Problems was Rejected—Is the Interview System Broken?

Therapy for Billionaires

Hyperglycosylation is a metabolic driver of Alzheimer's disease

Why Can't They Just

A look into Ubuntu Core 26: Building a local AI inference appliance