I think the thing is, there's an unspoken "for now" at the end of that sentence and people running this locally are hedging against that "for now". Some people prefer to feel that they own the means rather than rent the means, even if the one they own is worse than the one they can rent. Especially with today's Fable news and the harsh realisation that the "for now" is dependent on very many unpredictable factors, where the one you have locally costs you capital today and a relatively predictable run-rate (made more predictable with on-prem solar for example), but should otherwise work predictably forever.
I'm not saying that you're wrong to do what you're doing, just that many people have their own lines in the sand where renting vs buying makes sense, and it doesn't only boil down to a rational (or irrational) financial decision.
If suddenly the CCP declared a total digital embargo on Alibaba's Qwen models or even if for some reason all of mainland China (and Singapore) was completely unreachable from the rest of the world, the dozen or so companies selling Qwen by the token elsewhere in the world could continue business as usual.
I noticed recently that I started to prefer my local Qwen3.6 35B A3B and pi agent over Claude Code.
Both fail at different tasks, and Qwen more so than Claude.
But the way Qwen fails is much more straightforward. In writing tasks Qwens hallucinations and bullshitting are much easier to spot because it doesn't have the sleek vocabulary and wordsmithing skills to disguise its ignorance.
In coding tasks that Qwen can't solve it often just goes into a tool calling doom loop that the pi harness can catch, whereas Claude attempts ever more convoluted and creative things just making more and more mess that takes forever to clean up.
I think part of the story is that the tasks for which I use AI are fairly simple and maybe don't need a frontier model. But I wonder if "proper" developers had similar experience?
The moment I'm trying something open-ended or ambitious, Claude/ChatGPT clearly take you to the goal quicker.
For things, where there's a way to build a knowledgebase though, the local llm definitely can be a true contender. Plus, having a big context and no worries about filling it over and over - you can get quite far.
I'm writing this, literally in between cooking a pasta, that the local llm ordered products for me online. I've built a grocery shopping skill, so that it roughly knows what I have in fridge (losely), my last 10 representative orders (general preferences plus rich info about shops and skus around me) and actual real-time in stock info. The last part has been my personal pet peeve for every product that promised cooking ingredient delivery (that is not packaged specifically for that).
This is what has been promised to us by every big tech company with an agent, and now a local llms actually solved that for me fully.
Would like to see the perf of their setup with and without mtp and ngram speculative decoding though, as well as parallel decode performance (once llamacpp mtp plays well with multiple slots).
Being in California electricity alone puts this non-competitive with just paying a cloud though.
It's surprising how little these things come up given the price they go for
The entire stack (minus some binary blobs in firmware) is open source, so if you have the time and persistence you can get whatever you want done.
A few community members have been working on support with llamacpp, where we can have supported operations offloaded to the TT cards, while having unsupported ops running on GPU or CPU. Llamacpp is pretty good at that. The existing kernels could definitely be better, and I’ll try my hand at writing some kernels some time.
Very interesting though, these Tenstorrent chips. Might get one to experiment with.
NVIDIA GeForce RTX 5080: https://flopper.io/gpu/nvidia-geforce-rtx-5080-16gb
NVIDIA GeForce RTX 3090: https://flopper.io/gpu/nvidia-geforce-rtx-3090-24gb
On Apple Silicon, with MLX-LM, I am getting 20 tok/s with Macbook Max M5. Not sure how it compares to llama.cpp performance.
In any case, while it is noticeably slower than this Nvidia RTX setup, being able to run such models on laptop is wild. Though, it heats my laptop rapidly.
If you're not power limiting in nvidia-smi, start.
GPU0 GPU1
GPU0 X CNS
GPU1 CNS X
i guess not, i use llama.cpp with:
--spec-draft-n-max 3 --spec-type draft-mtp --split-mode tensor --tensor-split 1,1
and my (gen) tk/s are between 60-80 tk/s
will test this uncensored model and ngram added as well this weekend
btw, i also set my powerlimit to 220watt per card (with nvidia-smi) that will cost you around 1 tk/s but safe you a LOT of power and heat :)
https://huggingface.co/easiest-ai-shawn/Qwen3.6-27B-ExCal-EX...
https://huggingface.co/easiest-ai-shawn/Qwen3.6-27B-ExCal-Mi...
Do be sure to use dflash and/or mtp for the draft:
The options listed are none of these.
Also, the recommended Qwen MTP settings are `--spec-type draft-mtp --spec-draft-n-max 2`. 3 is not good on Nvidia hardware under different workloads. You can also add `ngram-mod`, but after `draft-mtp`; however, default `ngram-mod` settings aren't well tuned, and you want `--spec-ngram-mod-n-min 12 --spec-ngram-mod-n-max 16 --spec-ngram-mod-n-match 6` (defaults are 48, 64, 24; the ratio is good, the magnitude is suboptimal).
Of abliterated Qwen 3.6 27B models, huihui's ends up being the worst. Try heretic instead. https://huggingface.co/mradermacher/Qwen3.6-27B-uncensored-h...
It looks like there's a hardcoded preference, CLI order is not important.
(speculative.cpp:1322-1381): common_get_enabled_speculative_configs converts the types vector to a bitmask (order-independent). Then configs are added in a hardcoded priority order:
ngram-simple
ngram-map-k
ngram-map-k4v
ngram-mod
ngram-cache
draft-simple
draft-eagle3
draft-mtp
(speculative.cpp:1557-1603): common_speculative_draft iterates impls in the hardcoded priority order. Once an impl produces a draft for a sequence, later impls skip that sequence.
I guess I’m getting old. I own two 16gb cards and I use them for models, for gpu-pasthru for gaming, 3d model rendering, etc. 14 year old me is mortified at this community.
I just wish we had a way to actually benchmark them properly though. Still seems no one has solved the problem of software architecture, brittleness and bloat as the codebase grows. Models love to add stuff, but they rarely clean up as they go. In a perfect world they'd do both near equally as they're developing.
It would be nice if there was an "architecture quality" benchmark that distilled the essence of what it means to have a good architecture, but I suppose that's an open research question with a lot of variables? Like how is good architecture actually quantified and measured? Is there a mechanism that can be re-used across all codebases to clearly denote one that is good and one that is bad, or is it highly subjective and depend on the lens you're looking at it from? Is there a lot more to it than just "how much refactoring effort is required to extend this in the future?".
Surely this is something that has been well researched - yet I never really hear anything about it. Makes me wonder why.
Occam’s razor rings true here: where’s the money in it?
90 t/s for 27B Q8 256k context
260 t/s for 35B-A3B Q8 256k context
I tired coding with Pi and it was much faster than Claude, but for any not-straightforward tasks, it did so so. Either looping itself or not realising easy to spot constraints.
But for exploring codebases and asking questions about big stuff I find it better due to sheer speed.
Though if you're buying a X570 board I'd do crosshair viii dark hero - no buzzy chipset fan and can also do 2x8
Chances are the typical story goes founders start fully believing that they would succeed with their own innovation but slip down a gradient towards commodity provider without really noticing themselves.
Openrouter fking sucks and I don't know why people here act like it's so great. Stop using it if you care about local AI and accept that the cost you'll pay for tokens is higher than you will when consumed via any cloud. That's the price for privacy, control, and better quality via inference time optimizations that otherwise aren't available.
Openrouter gives you access to whatever the inference provider gives. They're just the middleman. Many providers give logprobs if you ask, it's in their API. And yeah, no Peft or Lora, but that's an entirely different product. And some of the inference providers do that directly.
> Openrouter fking sucks and I don't know why people here act like it's so great. Stop using it if you care about local AI
But the whole point of openrouter is that you can run models by the token and you don't have to care about local AI? Sounds like you're more upset that people aren't making the same calculation on privacy and local control vs cost and ease of use.
In terms of electricity, if you aren't using it, even with all the vram loaded, at most your wasting about 30 watts or so.
Prompt processing a large uncached context is annoying, which is why I forced a lower context window, but I don't know if it's any worse in performance than the cloud models I've used.
There's a niceness, to me, knowing I don't have to rent it anymore. If you rent it, the terms can change regularly.
How would that change (improve) if you had two R9700 in a similar configuration ?
The Qwen3.6-27B is unbearably slow as it doesn't fit in VRAM, though, i think the MoE is very easy to run.
It is also extremely nice that you can just `apt install llama.cpp libggml0-backend-vulkan` now too.
Yesterday I downloaded Gemma4-26B with Ollama on quite rusty desktop with 1070 8gb and 32gb of ram and Core i5-9400.
I drop photo of my water meter and tell it to read the value and serial number. It was far from instant but it was also easily under 3 minutes and result was correct.
Earlier like in February I was trying the same photo with Gemma3 on the same hardware and results were bad.
"Useful" as in "has a use that isn't just for show". It takes me two seconds to read a photo of a water meter. Having an LLM read it for me in 3 minutes isn't useful. Similarly small models are capable of tool use (e.g. web searches) but their synthesis leaves much to be desired. As an example I'd ask some small models to find examples of products with specific characteristics and they'd come back with only one or two because they discounted other possibilities incorrectly by reasoning themselves out of it.
> Feels like he was just comparing some benchmarks.
On what do you base this assertion?
Mostly use of this expression.
I don’t get agent to read the meter for me - I can do that when I take the photo.
I send the photo to a bot that ingests photos from me and stores readings for me with date and time so later I can ask „what was last reading” or what was the usage between x and y dates”, without me having to make a perfect photo, without me having to dabble with OpenCV.
Even if it takes 30mins it is still useful for me.
In last year, some people were publishing aider /ollama/open router [1] and now thankfully people are publishing all around about pi/qwen/llama.cpp/openrouter. It's widespread.
[1] https://alexhans.github.io/posts/aider-with-open-router.html
And noise.
I'm years after that strict diet needs, but that itch of fixing or easing some parts of the process stayed.
do you mean by commanding a browser? or using APIs?
The smaller models, especially the aforementioned ones, they fail much more, but, do not write that insanity of the code. They do simple, non-clever coding like humans do. Much easier to maintain and build upon.
Qwen-3.6-27b is a wonderful model. Exceptionally good for it's size, and excellent in general as well. And with mtp available now, it can run at 60+ tps on a single 3090... this is roughly 30% faster tgs than most of the hosted ones being served from giant data-centers.
ollama launch claude --model
I would characterize it as doable, but not really viable. It's "yes you can do it but it's a lot slower", with a hint of "and the best local LLMs are on par with Haiku or Maybe Sonnet so larger and longer tasks get notably worse".Because if they don't imply that size is needed for every task, they'll end up tanking their valuations.
If you want something comparable you need the trillion parameter open models like deepseek.
At some point there's diminishing returns and your coding LLM performs worse because you encoded useless stuff like Pokemon combinations or languages you don't speak into its parameter space.
The "smartness" of the model comes from RLHF post-training, which is orthogonal to model size.
Also, if you're using an agentic harness a much better approach is to let the model control its own context. If you ever reach a point where your coding LLM needs to know about Pokemon, just give it a web search tool and let it google the Pokemons.
They are way too trained/reinforced on solving problems rather than assisting you, something on which they have becoming extremely bad at.
It's hard to explain because I too had the many moments where "Fable5 / Opus4.8 xhigh could solve bugs/stuff that previous models couldn't", I know that to be true, and they are very useful for that.
But 90% of my tasks are quite mundane and I need thorough investigation and a proper assistant. Not a smart bullshitter fixated on solving the issue itself. On that Opus 4.6 has been the last good model.
Anything after that is completely skewed towards passing benchmarks and E2E tasks, but definitely not assisting.
Fable in particular was a disaster on that, non stop being thorough on the fix it fixated on, writing nthousand experiments in /tmp, etc. Great model, not gonna lie, but only if your focus is vibe coding and you accept that you're nothing but an assistant and accept its shortcomings.
Do you get the speed of the 5080 with the memory of the 3090?
Can't wait until we just remove the language from the LLMs for accuracy and efficiency
The main issue is the immature software, and somewhat baroque way of writing kernels. Please, buy one and join us.
I’m not even using a 800G cable since they are expensive and I don’t think I need the bandwidth, opting for 400G instead. This just needs a config change for the number of Ethernet links it uses internally. (Apparently these cables are just many 200G links put together.)
My setup has 4090 handling attention while TT accelerators handles MLP. With just a 4090 you can have CPU handle the MLP layers and use a MoE model, assuming sufficiently powerful cpu. I tried that setup with minimax 2.5 before, and was able to eke out around 10 to 15 tps (albeit with a 7965wx cpu)
Same here. There has to be someplace like this that's managed to cultivate a better crowd, but I'll be darned if I can find it.
Hn really is a bit of an echo chamber, not that diverse opinions aren’t expressed, but the folks that voice opinions that don’t align with a specific set of values aren’t very well received here.
I’ll also say that this place has shaped my values, including making me change my opinions on things I severely disagreed with at the time. I’ve also said a lot of shit on here I wish I could wipe out.
I feel the same here when "hackers" defend copyright maximalism, try to rehabilitate the Luddites, and argue that the Federal government should aggressively regulate AI models. Basically exhibiting both proud ignorance of history and reckless disregard for the future, all in one breath.
There are so many other places for that. So very, very many. Why do they come here? I spend time in those places as well, but I generally STFU when I have nothing to contribute, or when my core values conflict with their community charter.
I'd love to have multiple large VRAM GPUs but I can't justify the costs when I have plenty of other more important things to spend that money on.
2) I think it’s important to recognize that one of the things models are good for is astroturfing, and any given conversation you see may be direct or secondary effects of that (among other marketing).
ComputerGuru•1d ago
verdverm•1d ago
I've switched from using the spark as a way to run one model as best it can to running several support models for the md kb I'm working on
atq2119•1d ago
Memory bandwidth of RTX 3090 is listed as 936GB/s. The post isn't fully clear on which model they used and how big it is, but even assuming it perfectly filled the 24GB of that GPU, 30tok/s means the achieved bandwidth is only 720GB/s. There's a bunch of room for improvement here even without MTP, and those improvements should largely stack with MTP.