RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

https://imil.net/blog/posts/2026/rtx-5080-+-rtx-3090-setup-80+-tok-s-on-qwen-3.6-27b-q8/

285•iMil•1d ago

Comments

ComputerGuru•1d ago

I would have liked to see a bit more on the theory side of things, explaining optimal weight and inference splits, actual issues with existing drivers, etc instead of what’s essentially just a recipe.

verdverm•1d ago

I've been using https://spark-arena.com/leaderboard to glean this kind of information for DGX Spark, a sort of recipe book. The Nvidia forum has people talking about the things you wish to know. I see some on Discord/Reddit/et al, but less cohesive

I've switched from using the spark as a way to run one model as best it can to running several support models for the md kb I'm working on

atq2119•1d ago

Agreed. To put this in perspective, batch 1 token decode is bandwidth limited in theory.

Memory bandwidth of RTX 3090 is listed as 936GB/s. The post isn't fully clear on which model they used and how big it is, but even assuming it perfectly filled the 24GB of that GPU, 30tok/s means the achieved bandwidth is only 720GB/s. There's a bunch of room for improvement here even without MTP, and those improvements should largely stack with MTP.

deng•1d ago

I can understand the joy of running things yourself, and can also see the privacy aspect. However, I pay ~3$ per 1M/tokens for that model on Openrouter, and it's not even quantized. A refurbished 3090 and a 5080 will set you back well over 2k, not to mention the electricity to run them...

TSiege•1d ago

It’s a personal hobby project why should we care this is how someone chooses to spend their free time and money? Lots of hobbies are expensive and pointless if you think of commercially available offerings. That’s why it’s a hobby and not a small business

redfloatplane•1d ago

> I pay ~3$ per 1M/tokens for that model on Openrouter

I think the thing is, there's an unspoken "for now" at the end of that sentence and people running this locally are hedging against that "for now". Some people prefer to feel that they own the means rather than rent the means, even if the one they own is worse than the one they can rent. Especially with today's Fable news and the harsh realisation that the "for now" is dependent on very many unpredictable factors, where the one you have locally costs you capital today and a relatively predictable run-rate (made more predictable with on-prem solar for example), but should otherwise work predictably forever.

I'm not saying that you're wrong to do what you're doing, just that many people have their own lines in the sand where renting vs buying makes sense, and it doesn't only boil down to a rational (or irrational) financial decision.

jubilanti•1d ago

You're treating open weight inference providers the same as proprietary ones. They're fundamentally different business models. Proprietary companies have an incentive to subsidize actual inference and training costs in order to gain market share. The few dozen or so companies selling Qwen models by the token on openrouter are in a commodities market.

If suddenly the CCP declared a total digital embargo on Alibaba's Qwen models or even if for some reason all of mainland China (and Singapore) was completely unreachable from the rest of the world, the dozen or so companies selling Qwen by the token elsewhere in the world could continue business as usual.

avyeed_desa•1d ago

I just bought a $25 chinese 2x Oculink card and two Minis Forum DEG1, had some spare PSUs lying around, and just installed two cards on each. It works. I saw that there is also a 4x Oculink card, but i don't know it that will work, too.

atlgator•1d ago

Which "good quality PCIe 4 riser" did you buy?

iMil•1d ago

This one: https://es.aliexpress.com/item/1005010123289822.html?spm=a2g...

sieste•1d ago

That's almost exactly my setup and I'm very happy with its performance.

I noticed recently that I started to prefer my local Qwen3.6 35B A3B and pi agent over Claude Code.

Both fail at different tasks, and Qwen more so than Claude.

But the way Qwen fails is much more straightforward. In writing tasks Qwens hallucinations and bullshitting are much easier to spot because it doesn't have the sleek vocabulary and wordsmithing skills to disguise its ignorance.

In coding tasks that Qwen can't solve it often just goes into a tool calling doom loop that the pi harness can catch, whereas Claude attempts ever more convoluted and creative things just making more and more mess that takes forever to clean up.

I think part of the story is that the tasks for which I use AI are fairly simple and maybe don't need a frontier model. But I wonder if "proper" developers had similar experience?

eurekin•1d ago

I keep finding more and more usecases for Q3.6 27b (same league) and the best performance is, when answers to my question is already in the context.

The moment I'm trying something open-ended or ambitious, Claude/ChatGPT clearly take you to the goal quicker.

For things, where there's a way to build a knowledgebase though, the local llm definitely can be a true contender. Plus, having a big context and no worries about filling it over and over - you can get quite far.

I'm writing this, literally in between cooking a pasta, that the local llm ordered products for me online. I've built a grocery shopping skill, so that it roughly knows what I have in fridge (losely), my last 10 representative orders (general preferences plus rich info about shops and skus around me) and actual real-time in stock info. The last part has been my personal pet peeve for every product that promised cooking ingredient delivery (that is not packaged specifically for that).

This is what has been promised to us by every big tech company with an agent, and now a local llms actually solved that for me fully.

matthewfcarlson•1d ago

I keep playing around with this exact concept. While I don’t always trust entirely AI generated recipe, more traditional setups are super rigid when it comes to ingredients

ydj•1d ago

80tp/s with 5080 3090 combo is wild. I’ve been working with a 4090 and two Tenstorrent p150 cards, and manage only about 30 tps utilizing all three for qwen3.6 27b q8. Guess I got more optimization to do.

Would like to see the perf of their setup with and without mtp and ngram speculative decoding though, as well as parallel decode performance (once llamacpp mtp plays well with multiple slots).

Being in California electricity alone puts this non-competitive with just paying a cloud though.

manbart•1d ago

How is the software compatibilty with the Tenstorrent cards? Are you stuck using vendor supplied runtimes/models?

It's surprising how little these things come up given the price they go for

ydj•1d ago

The software stack is pretty immature, definitely very DIY. Their officially supported models are pretty old at this point, though there’s community support for gemma4, and models with GDN like qwen3.6 is supposedly very close.

The entire stack (minus some binary blobs in firmware) is open source, so if you have the time and persistence you can get whatever you want done.

A few community members have been working on support with llamacpp, where we can have supported operations offloaded to the TT cards, while having unsupported ops running on GPU or CPU. Llamacpp is pretty good at that. The existing kernels could definitely be better, and I’ll try my hand at writing some kernels some time.

arjie•1d ago

That’s the cost of using a new hardware provider. A single RTX Pro 6000 Blackwell Max-Q will do better than that and be much more usable. I have 2 running DS4 Flash at 160 tok/s with max num seqs 4.

Very interesting though, these Tenstorrent chips. Might get one to experiment with.

varispeed•1d ago

Could 2x RTX5080 work just as well?

iMil•1d ago

2xRTX5080 would be awesome. You'd only be able to run a q6, which it's already pretty good, but moreover you'd be able to use P2P and use Blackwell full speed, which I can't.

kcb•1d ago

With 2 Blackwells, would make sense to run NVFP4 quants

triwats•1d ago

Potential specs:

NVIDIA GeForce RTX 5080: https://flopper.io/gpu/nvidia-geforce-rtx-5080-16gb

NVIDIA GeForce RTX 3090: https://flopper.io/gpu/nvidia-geforce-rtx-3090-24gb

stared•1d ago

I really like Qwen 3.6 27B Q8.

On Apple Silicon, with MLX-LM, I am getting 20 tok/s with Macbook Max M5. Not sure how it compares to llama.cpp performance.

In any case, while it is noticeably slower than this Nvidia RTX setup, being able to run such models on laptop is wild. Though, it heats my laptop rapidly.

well_ackshually•1d ago

It does come with one tiny little issue: it now draws 700W on full load. Just a single 5080 is enough to measurably heat up a room when loaded (320W draw at the wall on mine), and with that amount of power flowing through, you better have a good PSU as well as checking your power plugs themselves, these are going to get HOT when your entire setup is basically drawing 1kW.

iMil•1d ago

I am actually surprised with the power draw, the box itself idles at 20W, which already amazes me for a Ryzen; when computing, I barely pass the 600W bar, and as I am not really using it to vibecode an entire system, I don't even notice the spikes on the power monitor (Shelly + homeassistant).

washadjeffmad•1d ago

I've got a 4090 and 3090 in a node that peaks at 600W.

If you're not power limiting in nvidia-smi, start.

cybertim•1d ago

I bought two 3080/20gb and one of those MACHINIST X99 mainboards as well (one with two full x16 pcie slots) those boards come with a xeon cpu included (for the pcie lane support) it set me back 800 euros total (had a spare psu, ssd and mem in a drawer) and now im also happily running 80tk/s Qwen 3.6 Q8 (MTP).

iMil•1d ago

Good call, I really hesitated between the X570 and the X99, are you using P2P?

cybertim•1d ago

$ nvidia-smi topo -p2p r

GPU0 GPU1

GPU0 X CNS

GPU1 CNS X

i guess not, i use llama.cpp with:

--spec-draft-n-max 3 --spec-type draft-mtp --split-mode tensor --tensor-split 1,1

and my (gen) tk/s are between 60-80 tk/s

will test this uncensored model and ngram added as well this weekend

btw, i also set my powerlimit to 220watt per card (with nvidia-smi) that will cost you around 1 tk/s but safe you a LOT of power and heat :)

iMil•1d ago

CNS means Chipset not supported and I doubt it is the case, are you sure you are using the patched nvidia module? modinfo nvidia to check which one is loaded

cybertim•1d ago

I'm using bazzite on my ai-rig just because it has the gpu-optimized things setup (also nvidia-open). Looking at P2P seems to be available only for 90-versions of the nvidia rtx gpu line, not 80, and some versions of 50xx? (apparently the 5080?). Anyways, i downloaded that uncensored model and tweaked those kv settings etc. still getting 60-80tk/s but im able to get my context on 180224 now, used to be 131072 which gave me some trouble, this is already a win :)

tonyrice•1d ago

If I had an eGPU right now, I'd 100% be using Qwen

skhameneh•1d ago

Would you mind giving these a try and let me know how they work for you? I’d imagine you would get better results and the latter will fit on a single GPU.

https://huggingface.co/easiest-ai-shawn/Qwen3.6-27B-ExCal-EX...

https://huggingface.co/easiest-ai-shawn/Qwen3.6-27B-ExCal-Mi...

Do be sure to use dflash and/or mtp for the draft:

https://huggingface.co/turboderp/Qwen3.6-27B-MTP-exl3

https://huggingface.co/turboderp/Qwen3.6-27B-DFlash-exl3

DiabloD3•1d ago

The recommended values for Qwen 3.6 in thinking mode is `--temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00`, and `--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00` for coding/tool calling tasks, and for non-thinking, `--temp 0.7 -top-p 0.8 --top-k 20 --presence-penalty 1.5 --min-p 0.00`.

The options listed are none of these.

Also, the recommended Qwen MTP settings are `--spec-type draft-mtp --spec-draft-n-max 2`. 3 is not good on Nvidia hardware under different workloads. You can also add `ngram-mod`, but after `draft-mtp`; however, default `ngram-mod` settings aren't well tuned, and you want `--spec-ngram-mod-n-min 12 --spec-ngram-mod-n-max 16 --spec-ngram-mod-n-match 6` (defaults are 48, 64, 24; the ratio is good, the magnitude is suboptimal).

Of abliterated Qwen 3.6 27B models, huihui's ends up being the worst. Try heretic instead. https://huggingface.co/mradermacher/Qwen3.6-27B-uncensored-h...

aand16•21h ago

> You can also add `ngram-mod`, but after `draft-mtp`

It looks like there's a hardcoded preference, CLI order is not important.

(speculative.cpp:1322-1381): common_get_enabled_speculative_configs converts the types vector to a bitmask (order-independent). Then configs are added in a hardcoded priority order:

ngram-simple

ngram-map-k

ngram-map-k4v

ngram-mod

ngram-cache

draft-simple

draft-eagle3

draft-mtp

(speculative.cpp:1557-1603): common_speculative_draft iterates impls in the hardcoded priority order. Once an impl produces a draft for a sequence, later impls skip that sequence.

DiabloD3•10h ago

Interesting.

WeylandDarkStar•1d ago

Sits in silence, watching China as they innovated a new type of ultra-thin gpu board and calling it 5090 "Turbos." Still waiting for Shenzhen listings to post a 5090 official verified with VBIOS crack...

neals•1d ago

I tried implementing qwen through openrouter and deepinfra. Even without thinking, I had to wait 60s+ for the full result, where haiku or flash would be done in 5 or 6 seconds.

irishcoffee•1d ago

It is absolutely mind blowing to see some of the responses here. Open source, run-your-own, pay for nothing, we’re-all-nerds-that-buy-the-hardware-anyways ethos seems basically dead.

I guess I’m getting old. I own two 16gb cards and I use them for models, for gpu-pasthru for gaming, 3d model rendering, etc. 14 year old me is mortified at this community.

nullbio•1d ago

Times are changing. The open-weight models have needed time to catch up, but they're finally at a point now where we can get almost frontier level capabilities for coding.

I just wish we had a way to actually benchmark them properly though. Still seems no one has solved the problem of software architecture, brittleness and bloat as the codebase grows. Models love to add stuff, but they rarely clean up as they go. In a perfect world they'd do both near equally as they're developing.

It would be nice if there was an "architecture quality" benchmark that distilled the essence of what it means to have a good architecture, but I suppose that's an open research question with a lot of variables? Like how is good architecture actually quantified and measured? Is there a mechanism that can be re-used across all codebases to clearly denote one that is good and one that is bad, or is it highly subjective and depend on the lens you're looking at it from? Is there a lot more to it than just "how much refactoring effort is required to extend this in the future?".

Surely this is something that has been well researched - yet I never really hear anything about it. Makes me wonder why.

irishcoffee•1d ago

> Surely this is something that has been well researched - yet I never really hear anything about it. Makes me wonder why.

Occam’s razor rings true here: where’s the money in it?

CamperBob2•1d ago

mirekrusin•1d ago

on 2x 4090:

90 t/s for 27B Q8 256k context

260 t/s for 35B-A3B Q8 256k context

tomekowal•1d ago

With qwen3.6-35b-a3b-mtp using lm-studio on RTX 3090, I was getting 120tokens/s. The mtp (multi token prediction) is the key.

I tired coding with Pi and it was much faster than Claude, but for any not-straightforward tasks, it did so so. Either looping itself or not realising easy to spot constraints.

But for exploring codebases and asking questions about big stuff I find it better due to sheer speed.

Havoc•20h ago

Very nice!

Though if you're buying a X570 board I'd do crosshair viii dark hero - no buzzy chipset fan and can also do 2x8

redfloatplane•1d ago

I was thinking of user-side regulations as well, not only provider-side ones. I could imagine a world where a government rules that you may not use LLMs for anything, which would be much easier to get around if you have local means.

bee_rider•1d ago

I don’t know anything about the open weight host business model. Do we know for certain that the folks selling inference by the token are really selling them in an upfront and profitable way? No subsidies from harvesting the info, to sell to the model trainers or anything like that?

usrusr•1d ago

Or subsidies from hopeful investors sweet-talked into not understanding the commodity nature of the business they are investing in. But that does not change much about the general assessment.

Chances are the typical story goes founders start fully believing that they would succeed with their own innovation but slip down a gradient towards commodity provider without really noticing themselves.

Der_Einzige•1d ago

Openrouter doesn't give you access to the models internals, i.e. complete control of logprobs, sampler stack, any PeFTs.

Openrouter fking sucks and I don't know why people here act like it's so great. Stop using it if you care about local AI and accept that the cost you'll pay for tokens is higher than you will when consumed via any cloud. That's the price for privacy, control, and better quality via inference time optimizations that otherwise aren't available.

jubilanti•1d ago

> Openrouter doesn't give you access to the models internals, i.e. complete control of logprobs, sampler stack, any PeFTs.

Openrouter gives you access to whatever the inference provider gives. They're just the middleman. Many providers give logprobs if you ask, it's in their API. And yeah, no Peft or Lora, but that's an entirely different product. And some of the inference providers do that directly.

> Openrouter fking sucks and I don't know why people here act like it's so great. Stop using it if you care about local AI

But the whole point of openrouter is that you can run models by the token and you don't have to care about local AI? Sounds like you're more upset that people aren't making the same calculation on privacy and local control vs cost and ease of use.

NicoJuicy•1d ago

Rtx 3090 24 gb set me back 390€ a year ago ( 2nd hand)

rirze•1d ago

Was it still in good condition? That price makes me wonder if it was used for crypto mining, which can wear down the hardware.

gsora•1d ago

Any sane crypto miner undervolted and underclocked their GPUs for efficiency's sake; if anything, they went through less wear than, say, regular gaming.

toyg•1d ago

Yeah but they can also be used to play games and do other stuff.

ThunderSizzle•1d ago

An R9700 is $1350 and can get 100 TPS running Qwen3.6-35B-A3B Q5 with 130k context window (with room to spare) with a bit of fine tuning llamacpp-vulkan, but llamacpp's repository instability and lack of real versioning frustrates me.

In terms of electricity, if you aren't using it, even with all the vram loaded, at most your wasting about 30 watts or so.

Prompt processing a large uncached context is annoying, which is why I forced a lower context window, but I don't know if it's any worse in performance than the cloud models I've used.

There's a niceness, to me, knowing I don't have to rent it anymore. If you rent it, the terms can change regularly.

bertili•1d ago

Qwen 27b is a compute heavy dense model.

rsync•1d ago

"An R9700 is $1350 and can get 100 TPS running Qwen3.6-35B-A3B Q5 with 130k context window ..."

How would that change (improve) if you had two R9700 in a similar configuration ?

vardalab•1d ago

better prompt processing like 1.5x+ and more kv but tg most likely lower like 0.8x or so but I am just going by memory for Qwen3.5 without mtp.

medfield•1d ago

I use local models to explore, hosted models to refine. I somewhat envy those who can sustain local models (q8 120b+) running as a hobby.... for me, the practical path is a better SearXNG setup and knowing my routes forward.

PeterStuer•1d ago

When they declare open models a 'security risk', his setup will be running, yours will not and even that 3090 will be way outside of your reach.

amelius•1d ago

You are paying with your privacy ...

alexjplant•1d ago

I've spent the past week trying to scheme a way to get affordable local inference of something useful (Qwen3.6-36B-A3B) for ~$500 and have come to the conclusion that it simply isn't viable. A pair of power-restricted P100s in a workstation gets close but the workstations themselves are expensive and rare as hen's teeth (not to mention loud and large). I think early '27 will be when things open up as the hardware market unclenches and further strides are made in small capable models.

mappu•1d ago

I'm running Qwen3.6-35B-A3B on a very ordinary desktop PC (32GB DDR5, 8GB Radeon 6600XT) and getting a useful 15-20 tok/sec out of it. The MoE architecture and auto offloading from system to VRAM is just fantastic. Unsloth Q4_K_XL.

The Qwen3.6-27B is unbearably slow as it doesn't fit in VRAM, though, i think the MoE is very easy to run.

It is also extremely nice that you can just `apt install llama.cpp libggml0-backend-vulkan` now too.

ozim•1d ago

I wonder what parent poster means with „useful” and what he actually tried? Feels like he was just comparing some benchmarks.

Yesterday I downloaded Gemma4-26B with Ollama on quite rusty desktop with 1070 8gb and 32gb of ram and Core i5-9400.

I drop photo of my water meter and tell it to read the value and serial number. It was far from instant but it was also easily under 3 minutes and result was correct.

Earlier like in February I was trying the same photo with Gemma3 on the same hardware and results were bad.

alexjplant•15h ago

> I drop photo of my water meter and tell it to read the value and serial number. It was far from instant but it was also easily under 3 minutes and result was correct.

"Useful" as in "has a use that isn't just for show". It takes me two seconds to read a photo of a water meter. Having an LLM read it for me in 3 minutes isn't useful. Similarly small models are capable of tool use (e.g. web searches) but their synthesis leaves much to be desired. As an example I'd ask some small models to find examples of products with specific characteristics and they'd come back with only one or two because they discounted other possibilities incorrectly by reasoning themselves out of it.

> Feels like he was just comparing some benchmarks.

On what do you base this assertion?

ozim•11h ago

trying to scheme a way

Mostly use of this expression.

I don’t get agent to read the meter for me - I can do that when I take the photo.

I send the photo to a bot that ingests photos from me and stores readings for me with date and time so later I can ask „what was last reading” or what was the usage between x and y dates”, without me having to make a perfect photo, without me having to dabble with OpenCV.

Even if it takes 30mins it is still useful for me.

alexhans•1d ago

I think it's important to be able to do both so you can stay in control of the price to value created relationship.

In last year, some people were publishing aider /ollama/open router [1] and now thankfully people are publishing all around about pi/qwen/llama.cpp/openrouter. It's widespread.

[1] https://alexhans.github.io/posts/aider-with-open-router.html

pier25•1d ago

> not to mention the electricity to run them...

And noise.

sixothree•1d ago

You also aren't limited to LLMS. Vision, whisper, etc. You can even have claude farm out tasks to your local servers.

eurekin•1d ago

I kept getting recipes with "that one ingredient", which was either a major PITA to source or produced too much waste, even from a real world dietician consultation. Example, use 1/4th of a pumpkin for something. Those were good recipes, in terms of macronutrient composition, but doesn't work long term due to logistics.

I'm years after that strict diet needs, but that itch of fixing or easing some parts of the process stayed.

ed_mercer•1d ago

>the local llm ordered products for me online

do you mean by commanding a browser? or using APIs?

eurekin•1d ago

Chrome driven by the OS accessibility API

porridgeraisin•1d ago

This is true. The failure modes are simpler. And yes the ceiling is lower as well. Smaller models stability is lower over long sequences. And thus anything that needs a lot of CoT will be weaker. For example, I had a dumb lock + condvar with multiple defenses against lost wakeups in a N producer 1 consumer queue thing. Models generally need a lot of CoT before they realise they can switch it to a semaphore instead. Qwen typically isn't stable over such long CoTs and ends up adding more and more slop and band aids versus a larger model that outputs a large CoT and then realises it can swap 3 functions out with 2 lines if we use a semaphore.

christkv•1d ago

It's also going to fail consistently. When calling Claude you don't know what version of the model you are talking to, it might be quantified sure to load or have been patched.

freakynit•1d ago

I have said this before as well: these top-of-the-line models write clever, convoluted code. The code looks intelligent from above, but is a maintenance headache. Makes entire thing fragile for future developments on top of it.

The smaller models, especially the aforementioned ones, they fail much more, but, do not write that insanity of the code. They do simple, non-clever coding like humans do. Much easier to maintain and build upon.

Qwen-3.6-27b is a wonderful model. Exceptionally good for it's size, and excellent in general as well. And with mtp available now, it can run at 60+ tps on a single 3090... this is roughly 30% faster tgs than most of the hosted ones being served from giant data-centers.

hamburglar•1d ago

Not having a lot of experience with this, I ask a naive question: is there a world where you can take your local LLM and hook it up to Claude and get more Claude-like results from your local model? Obviously, there are going to be material differences in how these perform, but are we getting close to a place where this is viable? I imagine that the answers are a combination of “not yet” and “yes but it’s a lot slower” and “yes but there is actually little point to doing this because ‘what Claude gets you’ is highly baked into anthropic’s models and that’s part of what you’re paying for.”

petu•1d ago

You're kinda talking about Claude being used for planning/architect role, while local LLM is just executing it (performing edits) -- at least in such form it's a thing, yes.

girvo•1d ago

I have a "task router" that is a small local LLM on my mac mini (Qwen 3.5 0.8B) that I use to decide (when activated) with Pi whether to route a given task to my local LLM (Step 3.7 Flash) or to <given cloud provider>, if that counts? It works surprisingly well really. Though some of the cloud providers are getting so good and so cheap (GLM 5.1/5.2, MiniMax M3, among others) that the need to use my local one becomes less and less relevant, depressingly!

z3t4•1d ago

opencode is like Claude code, but you can use any model.

znnajdla•1d ago

Already been done. Look at the Forge project for local LLMs. It can bring 8b models up to Opus-like performance at tool calling.

datadrivenangel•1d ago

You can use ollama as the backend for claude code!

  ollama launch claude --model

I would characterize it as doable, but not really viable. It's "yes you can do it but it's a lot slower", with a hint of "and the best local LLMs are on par with Haiku or Maybe Sonnet so larger and longer tasks get notably worse".

trueno•1d ago

i keep seeing people talk about pi harnesses. whats this about?

eyeris•1d ago

It’s one of the hot new-ish harnesses. Believe it’s like openclaw or Claude code without all of the defaults

https://pi.dev/

nullbio•1d ago

I know the big labs like to pretend that their models are trillion parameter. But how likely is that really to be the case when Qwen 3.6 35B A3B gets so close to their performance? Seems that with the best research applied, best training data, they'd be able to top the charts with a 60B model quite easily.

MisterKent•1d ago

They want people to believe they have massive models, that is effectively their moat at this point.

Because if they don't imply that size is needed for every task, they'll end up tanking their valuations.

https://blog.nilesh.io/post/ai-profit-race

redox99•1d ago

Qwen 35B isn't even remotely close to the big models. It's just people over hyping small models. Ignore the benchmarks they are almost meaningless.

If you want something comparable you need the trillion parameter open models like deepseek.

otabdeveloper4•19h ago

Number of parameters doesn't make the model smarter, it just makes it know more stuff out of the box.

At some point there's diminishing returns and your coding LLM performs worse because you encoded useless stuff like Pokemon combinations or languages you don't speak into its parameter space.

The "smartness" of the model comes from RLHF post-training, which is orthogonal to model size.

Also, if you're using an agentic harness a much better approach is to let the model control its own context. If you ever reach a point where your coding LLM needs to know about Pokemon, just give it a web search tool and let it google the Pokemons.

redox99•19h ago

That's just... not true. Just compare any open model which is trained with the same recipe but multiple sizes.

oneshtein•28m ago

You can compare models at OpenRouter site. Qwen 3.6 dense is in top 24% for coding.

iamanllm•1d ago

Frontier models are still better (everyone would use them if it was cheap). Open source models are capable on even non "simple" problems but I trust them less, even though I usually write plans for all changes, and they are worse at debugging. I recently converted my homelab to nixos and let's just say Deepseek failed and Fable did great (the night before getting killed)

epolanski•23h ago

While what you say is in general true, every model that followed Opus 4.6 on Anthropic side has been increasingly worse at what the previous user points out: they are extremely smart and can convince the user about major falsehood.

They are way too trained/reinforced on solving problems rather than assisting you, something on which they have becoming extremely bad at.

It's hard to explain because I too had the many moments where "Fable5 / Opus4.8 xhigh could solve bugs/stuff that previous models couldn't", I know that to be true, and they are very useful for that.

But 90% of my tasks are quite mundane and I need thorough investigation and a proper assistant. Not a smart bullshitter fixated on solving the issue itself. On that Opus 4.6 has been the last good model.

Anything after that is completely skewed towards passing benchmarks and E2E tasks, but definitely not assisting.

Fable in particular was a disaster on that, non stop being thorough on the fix it fixated on, writing nthousand experiments in /tmp, etc. Great model, not gonna lie, but only if your focus is vibe coding and you accept that you're nothing but an assistant and accept its shortcomings.

iamanllm•20h ago

yeah, the "proactivity" of recent anthropic models and sophisticated bullshitting are bad, although my experience is that even on simple tasks i've never used a oss model that has consistently been better in terms of the quality of the result.

ericb•23h ago

Do the two cards "share" their memory pool? Can work still be split across it? I'm wondering how it would do with something like fine tuning?

Do you get the speed of the 5080 with the memory of the 3090?

hnthrow0287345•20h ago

>In writing tasks Qwens hallucinations and bullshitting are much easier to spot because it doesn't have the sleek vocabulary and wordsmithing skills to disguise its ignorance.

Can't wait until we just remove the language from the LLMs for accuracy and efficiency

sieste•16h ago

Imagine how accurate you could be if you could circumvent the coding agent and just type the source code directly into an editor all by yourself o_O Like a write_file skill but for humans!!!

ydj•1d ago

Yeah that’s definitely the smarter buy if you want to just have models running quickly. But the cost of 2 p150 and a 4090 was <$5000 for me.

The main issue is the immature software, and somewhat baroque way of writing kernels. Please, buy one and join us.

arjie•18h ago

Were you able to connect the two P150 using the qsfp-dd cable? They only sell 4x and 8x topologies so I’m curious if that worked for you. Are you able to run them tensor parallel?

ydj•7h ago

Yeah, I’m doing TP with two cards. The topology is configured based on yaml files, and if you are not using a predefined config you can just create a new config with your topology.

I’m not even using a 800G cable since they are expensive and I don’t think I need the bandwidth, opting for 400G instead. This just needs a config change for the number of Ethernet links it uses internally. (Apparently these cables are just many 200G links put together.)

arjie•5h ago

Brilliant, thank you. Maybe I'll get a couple in a bit.

ricardobeat•1d ago

I get 28tps for Qwen3.6 27B on a Ryzen AI Max 395+, with enough spare memory to run another two small models on the side. 60tps for 35B. Am surprised this is not more common.

shepherdjerred•1d ago

Do you get anything useful out of your 4090 (I have one too)? Local cloud sounds like a fun idea but I just don’t see how it competes against OpenAI/Anthopic

ydj•1d ago

I think it’s not really worth it compared to just buying tokens or a coding plan.

My setup has 4090 handling attention while TT accelerators handles MLP. With just a 4090 you can have CPU handle the MLP layers and use a MoE model, assuming sufficiently powerful cpu. I tried that setup with minimax 2.5 before, and was able to eke out around 10 to 15 tps (albeit with a 7965wx cpu)

14 year old me is mortified at this community.

Same here. There has to be someplace like this that's managed to cultivate a better crowd, but I'll be darned if I can find it.

irishcoffee•13h ago

This place is probably the best you’ll find. I actually found hn from a site called meta-somethingOrOther, a long time ago, and that site is probably the closest I can think of.

Hn really is a bit of an echo chamber, not that diverse opinions aren’t expressed, but the folks that voice opinions that don’t align with a specific set of values aren’t very well received here.

I’ll also say that this place has shaped my values, including making me change my opinions on things I severely disagreed with at the time. I’ve also said a lot of shit on here I wish I could wipe out.

CamperBob2•13h ago

I'm fine with diverse opinions, as long as they're not too diverse... and yes, there is such a thing as "too diverse." If I were to barge into an Amish town meeting and harangue them about how they should be using Qwen 3.6 27B Q8 to plan their crop rotation schedule, I would soon find myself heading out of town facing south on a northbound mule. And that's OK.

I feel the same here when "hackers" defend copyright maximalism, try to rehabilitate the Luddites, and argue that the Federal government should aggressively regulate AI models. Basically exhibiting both proud ignorance of history and reckless disregard for the future, all in one breath.

There are so many other places for that. So very, very many. Why do they come here? I spend time in those places as well, but I generally STFU when I have nothing to contribute, or when my core values conflict with their community charter.

driverdan•16h ago

The problem is accessibility. GPUs have gotten expensive and in some cases hard to get. You also need the supporting hardware to use them.

I'd love to have multiple large VRAM GPUs but I can't justify the costs when I have plenty of other more important things to spend that money on.

wwweston•16h ago

1) Different people might optimize for different things. There are people calculating that expensive hardware plus cheap rentals means owning isn’t optimizing, but there are people making choices that fit your preferences too.

2) I think it’s important to recognize that one of the things models are good for is astroturfing, and any given conversation you see may be direct or secondary effects of that (among other marketing).

Free, BYOK resume optimizer to beat the ATS black hole

Dwarf Fortress in the Browser

Everything's Fine. (2024)

UK to ban social media for under-16s, following Australia's model

AI Study Tools: A Comparison of Flashcard and Spaced Repetition Apps

SimpleRelay, self-hosted SMTP relay for apps in a single Docker container

Millions of Lifetimes

Agentic-fs, a cloud-hosted filesystem for AI agents

War Is a Racket

LLM SoccerArena: Which model predicts the 2026 World Cup best?

Switching to a low-profile split keyboard after years on a TKL

The Ridiculous Engineering of Figma [video]

Keir Starmer confirms social media ban for all children under 16

Smooth: A Framework for Turning AI from Interesting to Useful

The Cloudflare for Autonomous AI Agents

Anthropic Dispatches Staff to D.C., Racing to Resolve AI Export Restrictions

I know I can, but should I? Capability vs. Intent in the AI Goldrush

Asterinas: A production-grade Linux-compatible alternative kernel

UK Brings in Full Social Media Ban for Under-16s

Inside tech elites’ madcap war against the California billionaire tax

Derbyshire officer investigated for using AI to create evidence in cases

Do-the-work instead of proof-of-work, for Git hosting

Ukraine's Zelenskiy discusses peace talks in call with Trump

Show HN: Canopy – parallel, sandboxed Claude Code sessions on macOS

Proof

Apple PowerMac G4 – Weapon (1999) [video]

You don't need React: creating a minimal UI library

Building a personal meeting assistant that routes through your existing audio

A spy in your pocket? How the UK's on-device nude image blocking could work

Employees are checking out of AI

Free, BYOK resume optimizer to beat the ATS black hole

Dwarf Fortress in the Browser

Everything's Fine. (2024)

UK to ban social media for under-16s, following Australia's model

AI Study Tools: A Comparison of Flashcard and Spaced Repetition Apps

SimpleRelay, self-hosted SMTP relay for apps in a single Docker container

Millions of Lifetimes

Agentic-fs, a cloud-hosted filesystem for AI agents

War Is a Racket

LLM SoccerArena: Which model predicts the 2026 World Cup best?

Switching to a low-profile split keyboard after years on a TKL

The Ridiculous Engineering of Figma [video]

Keir Starmer confirms social media ban for all children under 16

Smooth: A Framework for Turning AI from Interesting to Useful

The Cloudflare for Autonomous AI Agents

Anthropic Dispatches Staff to D.C., Racing to Resolve AI Export Restrictions

I know I can, but should I? Capability vs. Intent in the AI Goldrush

Asterinas: A production-grade Linux-compatible alternative kernel

UK Brings in Full Social Media Ban for Under-16s

Inside tech elites’ madcap war against the California billionaire tax

Derbyshire officer investigated for using AI to create evidence in cases

Do-the-work instead of proof-of-work, for Git hosting

Ukraine's Zelenskiy discusses peace talks in call with Trump

Show HN: Canopy – parallel, sandboxed Claude Code sessions on macOS

Proof

Apple PowerMac G4 – Weapon (1999) [video]

You don't need React: creating a minimal UI library

Building a personal meeting assistant that routes through your existing audio

A spy in your pocket? How the UK's on-device nude image blocking could work

Employees are checking out of AI

RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

Comments