frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

We got 207 tok/s with Qwen3.5-27B on an RTX 3090

https://github.com/Luce-Org/lucebox-hub
22•GreenGames•1h ago

Comments

GreenGames•1h ago
We built a standalone C++/ggml speculative decoder for Qwen3.5-27B Q4_K_M with a DFlash block-diffusion draft.

207.6 tok/s peak (5.46x over AR); HE 10-prompt bench averages 129.5 tok/s at DDTree budget=22, single RTX 3090, 24 GB. 3.43x over autoregressive and 2.8x over the best public SGLang AWQ number.

TL;DR - Peak 207.6 tok/s DFlash vs 38.0 tok/s AR (5.46x). HE bench: 129.5 tok/s mean at DDTree budget=22. - 3.43x over autoregressive Q4_K_M baseline (37.78 tok/s). - 2.8x vs SGLang AWQ reference (46.6 tok/s) on the same RTX 3090. - 128K context fits on 24 GB. Q4_0 KV + rolling 4096-slot target feature buffer. 134.78 tok/s at ctx=131072. - Only ggml. Never link libllama. ~2000 LOC C++/CUDA in libdflash27b.a around ggml_gated_delta_net.

Why the experiment exists Qwen3.5-27B is a hybrid model: every 4th layer is full softmax attention, the rest (48 of 64) are Gated DeltaNet. SSM state cache alongside the KV cache. That combo doesn't have a good single-3090 decode path today: llama.cpp has the GGUF loader and ggml_gated_delta_net, but no DFlash speculative decoding. vLLM / SGLang ship z-lab's DFlash integration, but only on BF16 (54 GB, doesn't fit on 24 GB). AWQ target on SGLang runs plain AR at 46.6 tok/s but can't host a BF16 draft + DDTree state in 24 GB. z-lab's reference benchmarks run BF16 on B200, 54+ GB class. We wanted the fastest single-3090 decode on a 24 GB card. The answer: port only the graph glue to ggml, keep the existing DeltaNet kernel, run DFlash block-diffusion draft with a DDTree verifier, compress KV to Q4_0 for long context.

From autoregressive to DDTree Same 10-prompt HE bench, n_gen=256, Q4_K_M target, BF16 draft. AL = average accept length. DDTree paper reports +35-42% over chain DFlash on pure-attention Qwen3 variants. On our hybrid Q4_K_M/RTX 3090 combo we see +15% over chain. The gap comes from Q4 quantization flattening the draft softmax, partially patched with a chain pre-seed in build_ddtree. Draft-ceiling bound, not verify-memory bound: a bigger tree won't help, only a better draft will.

Key wins - f16 intermediate cache: half the bandwidth, +5% at the same tree budget. Bit-identical to AR at 40 tokens. - Persist-write kernel (ggml_gated_delta_net_tree_persist): skips a 9 ms ggml_cpy per step, +11%. - target_feat compaction after sibling accept: unlocked real tree rescue on 9/10 prompts. - extract_draft_topk reverse bug: sort_heap + cmp_greater already produces descending order; an extra std::reverse was sending the worst candidate to the tree root. One-line fix. - verify_logits_buf overflow: sized vocabq_len but DDTree reads vocab(budget+1) past budget 15. Silent memory corruption. One-line size fix.

128K context on 24 GB Flash-attention in ggml-cuda supports Q4_0 K+V natively, so KV compression is just ggml_cpy with the F32->Q4_0 quantizer on write. 8x over f16. Combined with a rolling 4096-slot target_feat ring, target_feat shrinks from 6.6 GB to 0.2 GB at 128K. Tradeoffs: Q4_0 KV costs ~3% quality on HE (AL 8.56 -> 8.33) at short context, dramatically better at long ones. Only thing that lets 128K fit on 24 GB.

Prefill Short prompts (<=2048 tok): PREFILL_UBATCH=16. Matches DFlash block size. Long prompts (>2048 tok): auto-bump to PREFILL_UBATCH=192. 13K prefill: 40.9 s -> 15.07 s (2.7x, ~913 tok/s).

What comes next - Daemon mode: keep the model resident, first-token latency 10 s -> ms. - Temperature / top-k sampling in verify. Currently greedy-only. - Q5_K_M / Q6_K: better quants should recover most of the ~30-point accept gap vs BF16. - Full llama.cpp integration: qwen35 arch, llama-speculative-dflash.cpp wiring. - Metal/Vulkan: not planned. CUDA only, anyone who wants Metal can fork.

As soon as Qwen3.6-27B comes out, we'll do the same for it. Repo in the first comment (open source, MIT).

causal•9m ago
Cool. If I understand correctly though, the single-kernel only works on a single GPU right- no parallelism support to go Q8 on 2x3090?

Visibility, approvals, and auditability for multi-agent coding workflows

https://beta.actower.io/blog/visibility-approvals-auditability-multi-agent-workflows
1•gokhanozer•53s ago•0 comments

Show HN: Simple CLI tool to convert PDFs to dark mode, with TOC preservation

https://github.com/rngil/dark-pdf
1•rngil•54s ago•0 comments

BookShelves – Modern eBook reader and library manager for macOS and iOS

https://getbookshelves.app
1•janandonly•1m ago•0 comments

Openheim – open-source LLM agent in Rust (CLI, REPL, or HTTP server)

https://openheim.io
1•themartto•1m ago•0 comments

ClickHouse Native JSON Support in 2026: A PR-by-PR Analysis

https://dataanalyticsguide.substack.com/p/clickhouse-native-json-support-2026
1•manveerc•1m ago•0 comments

Spam – A Software PAckage Manager Utility

https://codeberg.org/aol/spam
1•iris-digital•3m ago•0 comments

What will be scarce? – by Alex Imas – Ghosts of Electricity

https://aleximas.substack.com/p/what-will-be-scarce
1•bilsbie•3m ago•0 comments

Opt-In Isn't a Guardrail

https://christophermeiklejohn.com/ai/zabriskie/agents/reliability/caucus/2026/04/14/opt-in-isnt-a...
1•azhenley•3m ago•0 comments

AI Resistance Is Growing

https://stephvee.ca/blog/artificial%20intelligence/ai-resistance-is-growing/
1•speckx•3m ago•0 comments

Agentic AI as a Part of Software Development

https://nemorize.com/roadmaps/agentic-ai-as-a-part-of-software-development
1•reverseblade2•5m ago•0 comments

Package Cooldown with SBOMs

https://www.interlynk.io/resources/cooldowns-with-sboms
2•surendrapathak•5m ago•0 comments

Trending projects from the GithubAwesome YouTube channel

https://mcowger.github.io/gha/
1•indigodaddy•6m ago•0 comments

More than 50% of young Dutch adults do not want children

https://nltimes.nl/2026/04/20/50-young-dutch-adults-want-children
3•randycupertino•6m ago•0 comments

How to make a video look much smoother, without increasing the file size?

https://www.seriousaboutech.com/2023/03/how-to-make-video-look-much-smoother.html
2•janandonly•9m ago•0 comments

Phones to be banned in schools by law in England under government plans

https://www.bbc.co.uk/news/articles/c5y7vd6gpq1o
2•mmarian•10m ago•0 comments

Over 200 Japanese firms have paid ransomware attackers; 60% fail to recover data

https://japantoday.com/category/crime/over-200-japanese-firms-paid-ransomware-attackers-60-fail-t...
1•xoxxala•10m ago•0 comments

IPv4, IPv6, and a sudden change in attitude (2020)

https://tailscale.com/blog/two-internets-both-flakey
1•frizlab•10m ago•0 comments

ASI-Evolve: AI Accelerates AI

https://arxiv.org/abs/2603.29640
1•Mars008•12m ago•0 comments

Git 2.54.0 Released

https://lwn.net/Articles/1068703/
2•kazu11max17•15m ago•0 comments

F-35 is a masterpiece built for the wrong war

https://warontherocks.com/cogs-of-war/the-f-35-is-a-masterpiece-built-for-the-wrong-war/
2•anjel•15m ago•0 comments

Waves and Particles

https://taylor.town/waves
3•birdculture•17m ago•0 comments

The Missing Bundler Features

https://byroot.github.io/ruby/bundler/2026/04/20/bundle-features.html
1•weaksauce•18m ago•0 comments

Show HN: Explain The Law – Simplified legislation and executive orders using AI

https://explainthelaw.com/
1•Nortey•18m ago•0 comments

OpenAI's Chronicles is basically what I open-sourced last week, to continue?

2•ainthusiast•20m ago•1 comments

About Homespring.cloud

https://homespring.cloud/about
1•mygrant•21m ago•1 comments

Three Time-to-Power Strategies That Failed in 2025

https://chrisgillett.org/three-failed-time-to-power-strategies
2•powermarketer•22m ago•0 comments

Sam Altman's World ID Expands Biometric Identity Checks

https://reclaimthenet.org/world-id-iris-scan-online-verification-expansion
3•uyzstvqs•24m ago•0 comments

Use Faker to improve the quality of your tests

https://howtotestfrontend.com/resources/why-you-should-use-faker
1•howToTestFE•26m ago•0 comments

Trying and Failing with Claude

https://www.uncorrelatedcontents.com/blog/trying-and-failing-with-claude
1•Uncorrelated•28m ago•0 comments

Review: How Africa Works

https://www.worksinprogress.news/p/review-how-africa-works
1•syracusian•29m ago•0 comments