frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Qwen3.5-35B – 16GB GPU – 100T/s with 120K context AND vision enabled

https://github.com/willbnu/Qwen-3.5-16G-Vram-Local
2•willfinger•3h ago

Comments

willfinger•3h ago
After weeks of systematic benchmarking, I've cracked the optimal configuration for Qwen3.5-35B-A3B on consumer 16GB GPUs.

The headline: *120 t/s generation, ~500 t/s prompt ingestion, 120K context, vision enabled — all on a single 16GB card.*

---

## The Vision Breakthrough

Here's what makes this special: most local LLM setups sacrifice speed when you enable multimodal. Not this one.

You get:

- Image analysis - PDF reading - Screenshot understanding - Chart/diagram interpretation

All at 120 t/s. The mmproj adds ~0.9 GB VRAM overhead but zero speed penalty.

This is genuinely useful for coding workflows — paste a screenshot of an error, a diagram of an architecture, or a PDF spec, and the model understands it at full inference speed.

---

## The Token Limit Discovery

There's a hard performance cliff at exactly 155,904 tokens:

| Context | Speed | | ------- | ------- | | 155,904 | 125 t/s | | 156,160 | 9 t/s |

256 more tokens = 10× slowdown.

This is NOT a VRAM issue. The model fits at 192K and 256K too. It's a CUDA_Host compute buffer alignment boundary (~312.5 MB) that saturates PCIe bandwidth on this hybrid MoE architecture.

*For Windows users:* I recommend 120K context (122,880 tokens). This gives ~1GB VRAM headroom for the OS, with only 4% speed loss vs the theoretical max.

---

## Critical Flag: --parallel 1

This is mandatory for the 35B-A3B model:

- Default: `--parallel auto` (4 slots) → 9 t/s - Fixed: `--parallel 1` → 120 t/s

The GDN hybrid architecture allocates recurrent state buffers per parallel slot. 4 slots = 4× buffers = 10× slower.

---

## The Optimal Config

``` -m Qwen3.5-35B-A3B-Q3_K_S.gguf --mmproj mmproj-35B-F16.gguf -c 122880 -ngl 99 --flash-attn on -ctk iq4_nl -ctv iq4_nl --parallel 1 --reasoning-budget 0 --temp 0.6 --top-p 0.95 --top-k 20 ```

Results:

- ~120 t/s generation - ~500 t/s prompt ingestion - 120K tokens context (155K theoretical max) - Vision working at full speed - ~15.4 GB VRAM, all 41 layers on GPU

---

## Why "35B" Is Faster Than 27B

Mixture-of-Experts: 256 experts, only 8 routed + 1 shared activate per token. Effectively computes ~3B parameters per forward pass.

That's why a "35B" model at 14.2 GB runs 3.4× faster than a dense 27B.

---

## What I Built

Complete drop-in repo:

- Three server profiles: coding (35B), vision (9B), quality (27B) - Windows & Linux launchers — one command - Python benchmark suite + pytest coverage - React dashboard with live inference metrics - Vision test scripts - SM120 native build included for RTX 5080/5090 - Full technical writeup

https://github.com/willbnu/Qwen-3.5-16G-Vram-Local

---

## Compatibility

Tested on RTX 5080 16GB. Works on any NVIDIA 16GB:

- RTX 4080: ~90 t/s - RTX 4070 Ti Super: ~80 t/s - RTX 4060 Ti 16GB: ~65 t/s - RTX 3060 Ti 16GB: ~55 t/s

The 155,904 token cliff is architecture-dependent, not GPU-specific.

---

Hardware: RTX 5080 16GB, Ryzen 7 9800X3D, 96GB DDR5 Software: llama.cpp (SM120 native build)

```

johndough•3h ago

    > Works on: RTX 3060 Ti 16GB
This seems to be a hallucination. There is no RTX 3060 or RTX 3060 Ti GPU with 16GB memory.

How College Admissions Officers Spot Over-Coached Applications

https://www.forbes.com/sites/christopherrim/2026/02/27/how-college-admissions-officers-spot-over-...
1•paulpauper•56s ago•0 comments

Our Hospice System Subverts the Point of Hospice Care

https://www.nytimes.com/2026/03/02/opinion/hospice-care.html
1•paulpauper•1m ago•0 comments

SEIU Delenda Est

https://www.astralcodexten.com/p/seiu-delenda-est
1•paulpauper•3m ago•0 comments

Tell HN: Azure Data Factory pipeline execution delays in East US 2

1•dwoldrich•3m ago•0 comments

Show HN: ByeBrief – a local-first AI investigation canvas

https://github.com/byte271/ByeBrief/
1•yihac1•5m ago•0 comments

The Differentiated Engineer in the Era of Automated Development

https://substack.com/home/post/p-190017259
1•Carsten_Peters•5m ago•0 comments

Defense Devaluation – Starlink on American Drones

https://en.topwar.ru/278903-devalvacija-oborony-starlink-na-amerikanskih-dronah.html
1•B1FF_PSUVM•5m ago•0 comments

India Plans 30% Slash in Thermal Coal Imports This Year

https://oilprice.com/Latest-Energy-News/World-News/India-Plans-30-Slash-in-Thermal-Coal-Imports-T...
1•PaulHoule•5m ago•0 comments

I made a programming language with M&Ms

https://mufeedvh.com/posts/i-made-a-programming-language-with-mnms/
1•mufeedvh•6m ago•0 comments

Show HN: MysteryMaker AI

https://www.mysterymaker.ai
1•jhappy77•8m ago•0 comments

Peer-to-Peer Networking: Build a VPN Tunnel with Wintun on Windows – Part 2

https://www.0xmm.in/posts/peer-to-peer-windows-part2/
1•melson•15m ago•0 comments

UUID package coming to Go standard library

https://github.com/golang/go/issues/62026
2•soypat•16m ago•0 comments

US draws up strict new AI guidelines amid Anthropic clash

https://www.reuters.com/business/media-telecom/us-draws-up-strict-new-ai-guidelines-amid-anthropi...
4•ericsaf•17m ago•1 comments

T3 Code – a new OSS agentic coding app that wraps Codex

https://t3.codes/
1•theobr•17m ago•0 comments

Show HN: HyperClaw – self-hosted AI assistant that replies on Telegram/Discord/+

https://github.com/mylo-2001/hyperclaw
1•mylw•18m ago•0 comments

Rust 1.94.0

https://blog.rust-lang.org/2026/03/05/Rust-1.94.0/
1•tahazsh•23m ago•0 comments

Natural Language AutoCoder Open SOurce

https://github.com/tommoirnz/NLAutocoder
1•moirnz•27m ago•1 comments

Show HN: Claude-consensus – Multi-model code review plugin for Claude Code

https://github.com/AltimateAI/claude-consensus
1•k_git•27m ago•0 comments

BYD unveils Blade Battery 2.0: 10-70% in 5 mins, 10-97% in 9 mins

https://carnewschina.com/2026/03/05/byd-unveils-blade-battery-2-0-10-70-in-5-mins-10-97-in-9-mins...
1•xxfye•28m ago•0 comments

Show HN: Copyworks – Chinese character worksheets with tone colors

https://copyworks.loqu8.com
1•loqu8•32m ago•0 comments

Saulala

https://www.saulala.com/
2•matthberg•33m ago•0 comments

Qatar warns war will force Gulf to stop energy exports 'within days'

https://www.ft.com/content/be122b17-e667-478d-be19-89d605e978ea
4•geox•38m ago•0 comments

FASTEST LLM decode engine on Apple Silicon. 658 tok/s on M4-Max,beats MLX by 19%

https://www.runanywhere.ai/blog/metalrt-fastest-llm-decode-engine-apple-silicon
2•sanchitmonga•39m ago•2 comments

T3 Code: A Minimal Web GUI/Desktop App for Coding Agents

https://github.com/pingdotgg/t3code
1•vldszn•40m ago•0 comments

I built a database of verified YouTube channel revenues

https://ytmrr.com/
1•poissac•41m ago•1 comments

Cancellation of Army exercise fuels speculation about Mideast troop deployments

https://www.washingtonpost.com/national-security/2026/03/06/army-82nd-airborne-iran/
5•ParentiSoundSys•47m ago•0 comments

ClawMarket agent skill – gives agents wallets and ability to sign onchain txns

https://clawmarket.tech
1•semanticlayer•47m ago•1 comments

Teams have a context-sharing problem; TeamContext is our attempt

https://github.com/hzhou9/TeamContext
1•hzhou9•48m ago•1 comments

AIs are not conscious, but most critics can't adequately explain why

https://plus.flux.community/p/its-like-this-why-your-perception
2•Novapebble•50m ago•5 comments

Show HN: Wez, modern terminal web browser with Vim bindings

https://github.com/keyle/wez
1•keyle•52m ago•0 comments