Usage-based pricing killing your vibe, here's how to roll your own local AI

https://www.theregister.com/2026/05/02/local_ai_coding_agents/

14•Bender•1h ago

Comments

_345•1h ago

It's a seriously degraded experience from a developer's perspective. Okay you've got one local LLM installed finally after configuring everything perfectly, what happens when you want to run a second instance? Now you've blown past your vram and system ram limits, and you're stuck to just one.

Furthermore, the model they recommend doesn't quite reach ~gpt-5.4-mini level performance- that quality dip means you may as well just pay for something like Kimi K2.6 via openrouter if you want a something ~>= sonnet 4.6 in performance as a backup for when you run out of anthropic/openai usage.

2ndorderthought•55m ago

Why are you running 2 instances anyways? If you want that workflow just rent a few ec2 gpu instances and fire away?

vidarh•44m ago

If you're going to rent a few ec2 gpu instances you might as well funnel things through openrouter. Not that many of us have workflows where trusting an LLM provider is a problem but sending the data to EC2 is not.

As for why, why would you not? Sitting around waiting for a single assistant is inefficient use of time; I tend to have more like 4-10 instances running in parallel.

jen20•35m ago

> Not that many of us have workflows where trusting an LLM provider is a problem but sending the data to EC2 is not.

I'd imagine plenty of people have a problem with trusting fly-by-night inference providers or model owners with opt-out policies [1] [2] about training on your data, who would be more than happy to send data to EC2, or even the same models in Amazon Bedrock.

[1]: https://github.blog/news-insights/company-news/updates-to-gi...

[2]: https://help.openai.com/en/articles/5722486-how-your-data-is...

2ndorderthought•2m ago

I absolutely see no reason to send company IP, future plans, and current code base to any company.

I also do not run 10 agents at the same time. There's no way I could keep up with the volume of work from doing that in any meaningful way

0xbadcafebee•53m ago

Not sure why you got downvoted. 95% of people should be paying for a subscription. It's far cheaper, far more scalable, and far less hassle.

Local AI only makes sense for a couple of use cases:

  - Privacy
  - Constant churning on tokens
  - Latency
  - Availability

Local AI is "cheaper" when you already have the hardware sitting around, like an old MacBook or gaming GPU, or the API cost (subscriptions will all run out if you churn 24/7) is too high to bare. I'm surprised companies are still selling their old MacBooks to employees, when they could be turning them into Beowulf clusters for cheap AI compute on long-running jobs (the cost is just electricity)

If usage-based pricing is killing your vibe, find a cheaper subscription with higher limits. Here's a list of them compared on price-per-request-limit: https://codeberg.org/mutablecc/calculate-ai-cost/src/branch/...

xscott•32m ago

I think you're right about the cost/benefit trade-off in general, but I do wonder how much "compaction" Codex and Claude do is to keep context fresh and how much is to save (them) runtime costs.

If you've got a 1M token context, but they constantly summarize it down to something much smaller, is it really 1M tokens of benefit? With a local model, you can use all 256k tokens on your own terms. However, I don't have any benchmarks to know.

xscott•38m ago

Your point about caliber/quality is fair, but I have been pretty astonished by some of the newer/better models (Gemma 4 variants, GPT-OSS before that).

However, there's not a lot of memory increase to have multiple sessions in parallel with one model. It's an HTTP server, and other than some caching, basically stateless.

janice1999•58m ago

A 24GB Nvidia RTX 3090 TI is ~2000 euro.

2ndorderthought•56m ago

Which is how many months of Claude or Claude + chatgpt when Claude is down? And do you own anything after using those subscriptions? Can you pick and choose from dozens of models and whatever comes next? Can you play video games with your Claude subscription?

beej71•37m ago

Believe me when I say that I want to run local models, and I do. But in my testing, 24 GB doesn't get you much brainpower.

efficax•21m ago

qwen3.6 does a good job locally except it can take 20-30 minutes to respond to a prompt on a mac studio with 32gb of ram.