We got 207 tok/s with Qwen3.5-27B on an RTX 3090

22•GreenGames•1h ago

Comments

GreenGames•1h ago

We built a standalone C++/ggml speculative decoder for Qwen3.5-27B Q4_K_M with a DFlash block-diffusion draft.

207.6 tok/s peak (5.46x over AR); HE 10-prompt bench averages 129.5 tok/s at DDTree budget=22, single RTX 3090, 24 GB. 3.43x over autoregressive and 2.8x over the best public SGLang AWQ number.

TL;DR - Peak 207.6 tok/s DFlash vs 38.0 tok/s AR (5.46x). HE bench: 129.5 tok/s mean at DDTree budget=22. - 3.43x over autoregressive Q4_K_M baseline (37.78 tok/s). - 2.8x vs SGLang AWQ reference (46.6 tok/s) on the same RTX 3090. - 128K context fits on 24 GB. Q4_0 KV + rolling 4096-slot target feature buffer. 134.78 tok/s at ctx=131072. - Only ggml. Never link libllama. ~2000 LOC C++/CUDA in libdflash27b.a around ggml_gated_delta_net.

Why the experiment exists Qwen3.5-27B is a hybrid model: every 4th layer is full softmax attention, the rest (48 of 64) are Gated DeltaNet. SSM state cache alongside the KV cache. That combo doesn't have a good single-3090 decode path today: llama.cpp has the GGUF loader and ggml_gated_delta_net, but no DFlash speculative decoding. vLLM / SGLang ship z-lab's DFlash integration, but only on BF16 (54 GB, doesn't fit on 24 GB). AWQ target on SGLang runs plain AR at 46.6 tok/s but can't host a BF16 draft + DDTree state in 24 GB. z-lab's reference benchmarks run BF16 on B200, 54+ GB class. We wanted the fastest single-3090 decode on a 24 GB card. The answer: port only the graph glue to ggml, keep the existing DeltaNet kernel, run DFlash block-diffusion draft with a DDTree verifier, compress KV to Q4_0 for long context.

From autoregressive to DDTree Same 10-prompt HE bench, n_gen=256, Q4_K_M target, BF16 draft. AL = average accept length. DDTree paper reports +35-42% over chain DFlash on pure-attention Qwen3 variants. On our hybrid Q4_K_M/RTX 3090 combo we see +15% over chain. The gap comes from Q4 quantization flattening the draft softmax, partially patched with a chain pre-seed in build_ddtree. Draft-ceiling bound, not verify-memory bound: a bigger tree won't help, only a better draft will.

Key wins - f16 intermediate cache: half the bandwidth, +5% at the same tree budget. Bit-identical to AR at 40 tokens. - Persist-write kernel (ggml_gated_delta_net_tree_persist): skips a 9 ms ggml_cpy per step, +11%. - target_feat compaction after sibling accept: unlocked real tree rescue on 9/10 prompts. - extract_draft_topk reverse bug: sort_heap + cmp_greater already produces descending order; an extra std::reverse was sending the worst candidate to the tree root. One-line fix. - verify_logits_buf overflow: sized vocabq_len but DDTree reads vocab(budget+1) past budget 15. Silent memory corruption. One-line size fix.

128K context on 24 GB Flash-attention in ggml-cuda supports Q4_0 K+V natively, so KV compression is just ggml_cpy with the F32->Q4_0 quantizer on write. 8x over f16. Combined with a rolling 4096-slot target_feat ring, target_feat shrinks from 6.6 GB to 0.2 GB at 128K. Tradeoffs: Q4_0 KV costs ~3% quality on HE (AL 8.56 -> 8.33) at short context, dramatically better at long ones. Only thing that lets 128K fit on 24 GB.

Prefill Short prompts (<=2048 tok): PREFILL_UBATCH=16. Matches DFlash block size. Long prompts (>2048 tok): auto-bump to PREFILL_UBATCH=192. 13K prefill: 40.9 s -> 15.07 s (2.7x, ~913 tok/s).

What comes next - Daemon mode: keep the model resident, first-token latency 10 s -> ms. - Temperature / top-k sampling in verify. Currently greedy-only. - Q5_K_M / Q6_K: better quants should recover most of the ~30-point accept gap vs BF16. - Full llama.cpp integration: qwen35 arch, llama-speculative-dflash.cpp wiring. - Metal/Vulkan: not planned. CUDA only, anyone who wants Metal can fork.

As soon as Qwen3.6-27B comes out, we'll do the same for it. Repo in the first comment (open source, MIT).

causal•9m ago

Cool. If I understand correctly though, the single-kernel only works on a single GPU right- no parallelism support to go Q8 on 2x3090?

Visibility, approvals, and auditability for multi-agent coding workflows

Show HN: Simple CLI tool to convert PDFs to dark mode, with TOC preservation

BookShelves – Modern eBook reader and library manager for macOS and iOS

Openheim – open-source LLM agent in Rust (CLI, REPL, or HTTP server)

ClickHouse Native JSON Support in 2026: A PR-by-PR Analysis

Spam – A Software PAckage Manager Utility

What will be scarce? – by Alex Imas – Ghosts of Electricity

Opt-In Isn't a Guardrail

AI Resistance Is Growing

Agentic AI as a Part of Software Development

Package Cooldown with SBOMs

Trending projects from the GithubAwesome YouTube channel

More than 50% of young Dutch adults do not want children

How to make a video look much smoother, without increasing the file size?

Phones to be banned in schools by law in England under government plans

Over 200 Japanese firms have paid ransomware attackers; 60% fail to recover data

IPv4, IPv6, and a sudden change in attitude (2020)

ASI-Evolve: AI Accelerates AI

Git 2.54.0 Released

F-35 is a masterpiece built for the wrong war

Waves and Particles

The Missing Bundler Features

Show HN: Explain The Law – Simplified legislation and executive orders using AI

OpenAI's Chronicles is basically what I open-sourced last week, to continue?

About Homespring.cloud

Three Time-to-Power Strategies That Failed in 2025

Sam Altman's World ID Expands Biometric Identity Checks

Use Faker to improve the quality of your tests

Trying and Failing with Claude

Review: How Africa Works