For most of my questions and 8-9b model works great. Upshot is not having chatgpt/meta sell my data or target me with random thoughts later.
Fortunately, I don't have to pick one or the other - instead I run Qwen 3.6 35B A3B. It's a bit slow with my 8gb GPU (I'm in the process of getting a bigger one) but again, to me the choice isn't "what's the best I can get", it's "what's the best local I can get".
I also wonder what quantization you are using? If you haven't tried other quants I really would
You can get work done with them if you have a harness that can drive outcomes without needing feedback (I've been building a tdd red to green agent harness lately that is very effective if given a good plan upfront). So if you can stand waiting a few days to see results that would only take hours with a model deployed to frontier nvidia hardware, you can get results this way.
_345•1h ago
Furthermore, the model they recommend doesn't quite reach ~gpt-5.4-mini level performance- that quality dip means you may as well just pay for something like Kimi K2.6 via openrouter if you want a something ~>= sonnet 4.6 in performance as a backup for when you run out of anthropic/openai usage.
2ndorderthought•1h ago
vidarh•1h ago
As for why, why would you not? Sitting around waiting for a single assistant is inefficient use of time; I tend to have more like 4-10 instances running in parallel.
jen20•1h ago
I'd imagine plenty of people have a problem with trusting fly-by-night inference providers or model owners with opt-out policies [1] [2] about training on your data, who would be more than happy to send data to EC2, or even the same models in Amazon Bedrock.
[1]: https://github.blog/news-insights/company-news/updates-to-gi...
[2]: https://help.openai.com/en/articles/5722486-how-your-data-is...
2ndorderthought•33m ago
I also do not run 10 agents at the same time. There's no way I could keep up with the volume of work from doing that in any meaningful way
killingtime74•18m ago
0xbadcafebee•1h ago
Local AI only makes sense for a couple of use cases:
Local AI is "cheaper" when you already have the hardware sitting around, like an old MacBook or gaming GPU, or the API cost (subscriptions will all run out if you churn 24/7) is too high to bare. I'm surprised companies are still selling their old MacBooks to employees, when they could be turning them into Beowulf clusters for cheap AI compute on long-running jobs (the cost is just electricity)If usage-based pricing is killing your vibe, find a cheaper subscription with higher limits. Here's a list of them compared on price-per-request-limit: https://codeberg.org/mutablecc/calculate-ai-cost/src/branch/...
xscott•1h ago
If you've got a 1M token context, but they constantly summarize it down to something much smaller, is it really 1M tokens of benefit? With a local model, you can use all 256k tokens on your own terms. However, I don't have any benchmarks to know.
xscott•1h ago
However, there's not a lot of memory increase to have multiple sessions in parallel with one model. It's an HTTP server, and other than some caching, basically stateless.
iib•7m ago