RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

https://imil.net/blog/posts/2026/rtx-5080-+-rtx-3090-setup-80+-tok-s-on-qwen-3.6-27b-q8/

66•iMil•6h ago

Comments

ComputerGuru•1h ago

I would have liked to see a bit more on the theory side of things, explaining optimal weight and inference splits, actual issues with existing drivers, etc instead of what’s essentially just a recipe.

verdverm•1h ago

I've been using https://spark-arena.com/leaderboard to glean this kind of information for DGX Spark, a sort of recipe book. The Nvidia forum has people talking about the things you wish to know. I see some on Discord/Reddit/et al, but less cohesive

I've switched from using the spark as a way to run one model as best it can to running several support models for the md kb I'm working on

atq2119•28m ago

Agreed. To put this in perspective, batch 1 token decode is bandwidth limited in theory.

Memory bandwidth of RTX 3090 is listed as 936GB/s. The post isn't fully clear on which model they used and how big it is, but even assuming it perfectly filled the 24GB of that GPU, 30tok/s means the achieved bandwidth is only 720GB/s. There's a bunch of room for improvement here even without MTP, and those improvements should largely stack with MTP.

deng•57m ago

I can understand the joy of running things yourself, and can also see the privacy aspect. However, I pay ~3$ per 1M/tokens for that model on Openrouter, and it's not even quantized. A refurbished 3090 and a 5080 will set you back well over 2k, not to mention the electricity to run them...

TSiege•53m ago

It’s a personal hobby project why should we care this is how someone chooses to spend their free time and money? Lots of hobbies are expensive and pointless if you think of commercially available offerings. That’s why it’s a hobby and not a small business

redfloatplane•49m ago

> I pay ~3$ per 1M/tokens for that model on Openrouter

I think the thing is, there's an unspoken "for now" at the end of that sentence and people running this locally are hedging against that "for now". Some people prefer to feel that they own the means rather than rent the means, even if the one they own is worse than the one they can rent. Especially with today's Fable news and the harsh realisation that the "for now" is dependent on very many unpredictable factors, where the one you have locally costs you capital today and a relatively predictable run-rate (made more predictable with on-prem solar for example), but should otherwise work predictably forever.

I'm not saying that you're wrong to do what you're doing, just that many people have their own lines in the sand where renting vs buying makes sense, and it doesn't only boil down to a rational (or irrational) financial decision.

jubilanti•15m ago

You're treating open weight inference providers the same as proprietary ones. They're fundamentally different business models. Proprietary companies have an incentive to subsidize actual inference and training costs in order to gain market share. The few dozen or so companies selling Qwen models by the token on openrouter are in a commodities market.

If suddenly the CCP declared a total digital embargo on Alibaba's Qwen models or even if for some reason all of mainland China (and Singapore) was completely unreachable from the rest of the world, the dozen or so companies selling Qwen by the token elsewhere in the world could continue business as usual.

avyeed_desa•57m ago

I just bought a $25 chinese 2x Oculink card and two Minis Forum DEG1, had some spare PSUs lying around, and just installed two cards on each. It works. I saw that there is also a 4x Oculink card, but i don't know it that will work, too.

atlgator•45m ago

Which "good quality PCIe 4 riser" did you buy?

iMil•19m ago

This one: https://es.aliexpress.com/item/1005010123289822.html?spm=a2g...

sieste•36m ago

That's almost exactly my setup and I'm very happy with its performance.

I noticed recently that I started to prefer my local Qwen3.6 35B A3B and pi agent over Claude Code.

Both fail at different tasks, and Qwen more so than Claude.

But the way Qwen fails is much more straightforward. In writing tasks Qwens hallucinations and bullshitting are much easier to spot because it doesn't have the sleek vocabulary and wordsmithing skills to disguise its ignorance.

In coding tasks that Qwen can't solve it often just goes into a tool calling doom loop that the pi harness can catch, whereas Claude attempts ever more convoluted and creative things just making more and more mess that takes forever to clean up.

I think part of the story is that the tasks for which I use AI are fairly simple and maybe don't need a frontier model. But I wonder if "proper" developers had similar experience?

ydj•6m ago

80tp/s with 5080 3090 combo is wild. I’ve been working with a 4090 and two Tenstorrent p150 cards, and manage only about 30 tps utilizing all three for qwen3.6 27b q8. Guess I got more optimization to do.

Would like to see the perf of their setup with and without mtp and ngram speculative decoding though, as well as parallel decode performance (once llamacpp mtp plays well with multiple slots).

Der_Einzige•41m ago

Openrouter doesn't give you access to the models internals, i.e. complete control of logprobs, sampler stack, any PeFTs.

Openrouter fking sucks and I don't know why people here act like it's so great. Stop using it if you care about local AI and accept that the cost you'll pay for tokens is higher than you will when consumed via any cloud. That's the price for privacy, control, and better quality via inference time optimizations that otherwise aren't available.

jubilanti•8m ago

> Openrouter doesn't give you access to the models internals, i.e. complete control of logprobs, sampler stack, any PeFTs.

Openrouter gives you access to whatever the inference provider gives. They're just the middleman. Many providers give logprobs if you ask, it's in their API. And yeah, no Peft or Lora, but that's an entirely different product. And some of the inference providers do that directly.

> Openrouter fking sucks and I don't know why people here act like it's so great. Stop using it if you care about local AI

But the whole point of openrouter is that you can run models by the token and you don't have to care about local AI? Sounds like you're more upset that people aren't making the same calculation on privacy and local control vs cost and ease of use.

US bans differential privacy in Census data

Treating pancreatic tumours may have revealed cancer's master switch

Orthodox C++

Every Frame Perfect

AI OSS tool repo goes archived over night after raising $7.3M Seed

Introduction to the experience of rendering Arabic typography&its technical debt

A low-carbon computing platform from your retired phones

Show HN: I am building a map of people who lived in the Roman Empire

Appreciating Exif

The state of building user interfaces in Rust

Electric motors with no rare earths

RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

An Interview with Intel's Kira Boyko: Xeon 6's Product Director

CRISPR tech selectively shreds cancer cells, including "undruggable" cancers

Statement on US government directive to suspend access to Fable 5 and Mythos 5

Arch Linux Now Believes Malware Incident Under Control: More Than 1,500 Packages

Show HN: Paca – Lightweight Jira alternative for human-AI collaboration

Open source AI must win

Show HN: 2 Weeks of Hallucinate – The Photo Gallery

How to setup a local coding agent on macOS

Shepherd's Dog: A Game by the Most Dangerous AI Model

The computer science degree isn’t dead

Show HN: Putt.day a daily mini golf game

There is a shadow hanging over this Fable thing

Leaving Mozilla

Malware developers added nuclear and biological weapons text to to their spyware

Twenty One Zero-Days in FFmpeg

Swift at Apple: Migrating the TrueType hinting interpreter

H.R. 6028 would fundamentally change the U.S. Copyright Office

Sam Bankman-Fried loses bid to appeal against fraud conviction in FTX case

RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

Comments

US bans differential privacy in Census data

Treating pancreatic tumours may have revealed cancer's master switch

Orthodox C++

Every Frame Perfect

AI OSS tool repo goes archived over night after raising $7.3M Seed

Introduction to the experience of rendering Arabic typography&its technical debt

A low-carbon computing platform from your retired phones

Show HN: I am building a map of people who lived in the Roman Empire

Appreciating Exif

The state of building user interfaces in Rust

Electric motors with no rare earths

RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

An Interview with Intel's Kira Boyko: Xeon 6's Product Director

CRISPR tech selectively shreds cancer cells, including "undruggable" cancers

Statement on US government directive to suspend access to Fable 5 and Mythos 5

Arch Linux Now Believes Malware Incident Under Control: More Than 1,500 Packages

Show HN: Paca – Lightweight Jira alternative for human-AI collaboration

Open source AI must win

Show HN: 2 Weeks of Hallucinate – The Photo Gallery

How to setup a local coding agent on macOS

Shepherd's Dog: A Game by the Most Dangerous AI Model

The computer science degree isn’t dead

Show HN: Putt.day a daily mini golf game

There is a shadow hanging over this Fable thing

Leaving Mozilla

Malware developers added nuclear and biological weapons text to to their spyware

Twenty One Zero-Days in FFmpeg

Swift at Apple: Migrating the TrueType hinting interpreter

H.R. 6028 would fundamentally change the U.S. Copyright Office

Sam Bankman-Fried loses bid to appeal against fraud conviction in FTX case