Qwen3.5-35B – 16GB GPU – 100T/s with 120K context AND vision enabled

https://github.com/willbnu/Qwen-3.5-16G-Vram-Local

2•willfinger•3h ago

Comments

willfinger•3h ago

After weeks of systematic benchmarking, I've cracked the optimal configuration for Qwen3.5-35B-A3B on consumer 16GB GPUs.

The headline: *120 t/s generation, ~500 t/s prompt ingestion, 120K context, vision enabled — all on a single 16GB card.*

---

## The Vision Breakthrough

Here's what makes this special: most local LLM setups sacrifice speed when you enable multimodal. Not this one.

You get:

- Image analysis - PDF reading - Screenshot understanding - Chart/diagram interpretation

All at 120 t/s. The mmproj adds ~0.9 GB VRAM overhead but zero speed penalty.

This is genuinely useful for coding workflows — paste a screenshot of an error, a diagram of an architecture, or a PDF spec, and the model understands it at full inference speed.

---

## The Token Limit Discovery

There's a hard performance cliff at exactly 155,904 tokens:

| Context | Speed | | ------- | ------- | | 155,904 | 125 t/s | | 156,160 | 9 t/s |

256 more tokens = 10× slowdown.

This is NOT a VRAM issue. The model fits at 192K and 256K too. It's a CUDA_Host compute buffer alignment boundary (~312.5 MB) that saturates PCIe bandwidth on this hybrid MoE architecture.

*For Windows users:* I recommend 120K context (122,880 tokens). This gives ~1GB VRAM headroom for the OS, with only 4% speed loss vs the theoretical max.

---

## Critical Flag: --parallel 1

This is mandatory for the 35B-A3B model:

- Default: `--parallel auto` (4 slots) → 9 t/s - Fixed: `--parallel 1` → 120 t/s

The GDN hybrid architecture allocates recurrent state buffers per parallel slot. 4 slots = 4× buffers = 10× slower.

---

## The Optimal Config

``` -m Qwen3.5-35B-A3B-Q3_K_S.gguf --mmproj mmproj-35B-F16.gguf -c 122880 -ngl 99 --flash-attn on -ctk iq4_nl -ctv iq4_nl --parallel 1 --reasoning-budget 0 --temp 0.6 --top-p 0.95 --top-k 20 ```

Results:

- ~120 t/s generation - ~500 t/s prompt ingestion - 120K tokens context (155K theoretical max) - Vision working at full speed - ~15.4 GB VRAM, all 41 layers on GPU

---

## Why "35B" Is Faster Than 27B

Mixture-of-Experts: 256 experts, only 8 routed + 1 shared activate per token. Effectively computes ~3B parameters per forward pass.

That's why a "35B" model at 14.2 GB runs 3.4× faster than a dense 27B.

---

## What I Built

Complete drop-in repo:

- Three server profiles: coding (35B), vision (9B), quality (27B) - Windows & Linux launchers — one command - Python benchmark suite + pytest coverage - React dashboard with live inference metrics - Vision test scripts - SM120 native build included for RTX 5080/5090 - Full technical writeup

https://github.com/willbnu/Qwen-3.5-16G-Vram-Local

---

## Compatibility

Tested on RTX 5080 16GB. Works on any NVIDIA 16GB:

- RTX 4080: ~90 t/s - RTX 4070 Ti Super: ~80 t/s - RTX 4060 Ti 16GB: ~65 t/s - RTX 3060 Ti 16GB: ~55 t/s

The 155,904 token cliff is architecture-dependent, not GPU-specific.

---

Hardware: RTX 5080 16GB, Ryzen 7 9800X3D, 96GB DDR5 Software: llama.cpp (SM120 native build)

```

johndough•3h ago

    > Works on: RTX 3060 Ti 16GB

This seems to be a hallucination. There is no RTX 3060 or RTX 3060 Ti GPU with 16GB memory.

How College Admissions Officers Spot Over-Coached Applications

Our Hospice System Subverts the Point of Hospice Care

SEIU Delenda Est

Tell HN: Azure Data Factory pipeline execution delays in East US 2

Show HN: ByeBrief – a local-first AI investigation canvas

The Differentiated Engineer in the Era of Automated Development

Defense Devaluation – Starlink on American Drones

India Plans 30% Slash in Thermal Coal Imports This Year

I made a programming language with M&Ms

Show HN: MysteryMaker AI

Peer-to-Peer Networking: Build a VPN Tunnel with Wintun on Windows – Part 2

UUID package coming to Go standard library

US draws up strict new AI guidelines amid Anthropic clash

T3 Code – a new OSS agentic coding app that wraps Codex

Show HN: HyperClaw – self-hosted AI assistant that replies on Telegram/Discord/+

Rust 1.94.0

Natural Language AutoCoder Open SOurce

Show HN: Claude-consensus – Multi-model code review plugin for Claude Code

BYD unveils Blade Battery 2.0: 10-70% in 5 mins, 10-97% in 9 mins

Show HN: Copyworks – Chinese character worksheets with tone colors

Saulala

Qatar warns war will force Gulf to stop energy exports 'within days'

FASTEST LLM decode engine on Apple Silicon. 658 tok/s on M4-Max,beats MLX by 19%

T3 Code: A Minimal Web GUI/Desktop App for Coding Agents

I built a database of verified YouTube channel revenues

Cancellation of Army exercise fuels speculation about Mideast troop deployments

ClawMarket agent skill – gives agents wallets and ability to sign onchain txns

Teams have a context-sharing problem; TeamContext is our attempt

AIs are not conscious, but most critics can't adequately explain why

Show HN: Wez, modern terminal web browser with Vim bindings