news newest ask show jobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Qwen3.6-35B-A3B speculative decoding is net-negative on RTX 3090

https://github.com/thc1006/qwen3.6-speculative-decoding-rtx3090

5•thc1006•1h ago

Comments

thc1006•1h ago

Author here. Following llama.cpp PR #19493 (speculative checkpointing) merge on 2026-04-19, I ran a 19-configuration matrix on a single RTX 3090 with Qwen3.6-35B-A3B (UD-Q4_K_XL, 21 GB on disk). None of ngram-cache, ngram-mod (including srogmann's recommended n=24 --draft-min 48 --draft-max 64), or classic --model-draft with the vocab-matched Qwen3.5-0.8B achieves a net speedup over the non-speculative baseline of 135.7 tok/s. Every draft-enabled config hits a bimodal tail of 59–67 tok/s on reasoning / code prompts despite 100 % draft acceptance.

Controls ruled out: KV quantization (fp16 KV also regresses), output length (300 → 1000 tokens unchanged), and draft-model choice (Qwen3:0.6B has vocab 151936 and silently fails; Qwen3.5-0.8B matches vocab 248320 and loads correctly, still loses).

The pattern matches MoESD (arXiv 2505.19645) and Utility-Driven SD for MoE (arXiv 2506.20675). A3B has 3B active parameters out of ~35B total; with sparsity 0.031, the expert-saturation batch-size threshold is roughly 94 tokens. Draft K of 3–32 is well below that, so every drafted token pulls a fresh expert slice, and verification pays for the union. 100% acceptance cannot rescue this.

srogmann's own Qwen3.5-122B-A10B benchmark in PR #20075 shows +15–45% speedup, which is consistent with A10B being above the threshold. So the PR itself works as intended on A10B+; the phenomenon is class-specific to small-active MoE.

Raw per-request JSON, 3 matplotlib plots, aggregated CSV, BENCHMARK_ENV.md (driver, CUDA, commit, model SHA256), and the exact run_*_matrix.sh are in the repo. Happy to accept replications from other Ampere cards.

Substack added a scheduler. Here's why I kept building PubQ anyway

https://www.indiehackers.com/post/substack-added-a-scheduler-heres-why-i-kept-building-pubq-anywa...

1•rkapdi•1m ago•0 comments

Trump's Landman Iran Strategy [video]

https://www.youtube.com/watch?v=VZsm3Z2njAQ

1•keepamovin•2m ago•0 comments

They Built the 'Cursor for Hardware.' Now, Anthropic Wants In

https://www.wired.com/story/schematik-is-cursor-for-hardware-anthropic-wants-in-on-it/

1•CharlesW•2m ago•0 comments

My Linux Setup for Work and Life – NixOS, Niri, Helix [video]

https://www.youtube.com/watch?v=CeUOz_xtO-o

1•AnthOlei•2m ago•0 comments

Show HN: Kern – Agents that do the work and show it

https://github.com/oguzbilgic/kern-ai

1•obilgic•5m ago•0 comments

Sony implementing age verification for PlayStation users

https://twitter.com/CR1337/status/2046427329866694676

1•CR1337•9m ago•1 comments

The Ferrari of Espresso Machines Is Fueling a Hot Resale Market

https://www.nytimes.com/2026/04/20/dining/la-marzocco-espresso-machine.html

3•mitchbob•13m ago•1 comments

Voice to Instrument

1•starkiron•14m ago•0 comments

Wormhall

http://iladelf.org/wormhall/index.html

1•madprops•14m ago•0 comments

Claude Desktop Works with OpenCode Go

https://gist.github.com/avarayr/a9a35354aa6d7d8430ce0c27cd9aff3f

1•mikamika83•15m ago•0 comments

Mathematician Collapses All Functions to One Weird Formula [video]

https://www.youtube.com/watch?v=hwtqJaS42xk

2•darepublic•23m ago•0 comments

The SF Group Chat

https://twitter.com/daniel_dhawan/status/2041913527045386447

1•nowflux•26m ago•0 comments

It's not just one thing – it's another thing

https://techcrunch.com/2026/04/20/ai-writing-its-not-just-this-its-that-barrons/

1•davikr•27m ago•0 comments

Show HN: I built an AI that assigns YOU tasks

https://www.pause.build/

1•chaidhat•46m ago•1 comments

Apple iPhone texting changes: they fixed everything and changed nothing

https://webmatrices.com/post/apple-iphone-texting-changes-they-fixed-everything-and-changed-nothing

2•bishwasbh•47m ago•0 comments

Show HN: pg_roast – A Postgres extension that harshly judges your database

https://github.com/samirketema/pg_roast

2•samirketema•49m ago•1 comments

Homeland Security is making "smart glasses" to collect intelligence on Americans

https://www.kenklippenstein.com/p/exclusive-ice-glasses

6•c420•53m ago•0 comments

Red Queen Hypothesis

https://en.wikipedia.org/wiki/Red_Queen_hypothesis

5•Hooke•53m ago•0 comments

FanDuel wants to carve a sports niche in the prediction market business

https://www.cnn.com/2026/04/19/tech/fanduel-prediction-markets-app

1•1659447091•55m ago•0 comments

"You're mad Lad figured it out " – OpenClaw creator [video]

https://www.youtube.com/watch?v=7rzYDM6vMtI

2•0xAntonioo•56m ago•0 comments

String Seed of Thought: Prompting for Distribution-Faithful, Diverse Generation

https://pub.sakana.ai/ssot/

1•hardmaru•57m ago•0 comments

Show HN: Palmier – bridge your AI agents and your phone

https://github.com/caihongxu/palmier

2•caihongxu•58m ago•0 comments

pnpm v11 is almost here

https://twitter.com/pnpmjs/status/2045901598006690244

1•bpierre•1h ago•0 comments

Futuristic analyser tool? – what is this – omg

https://rogmash.neocities.org/3drein

1•rogmash•1h ago•0 comments

KMDS, now with natural language ingestion and search

https://github.com/rajivsam/kmds

1•rsva•1h ago•1 comments

Amazon behind on jobs promised for funding to build Virginia headquarters

https://www.washingtonpost.com/dc-md-va/2026/04/20/amazon-h2q-virginia-headquarters/

2•reaperducer•1h ago•0 comments

Digital Ecosystems: Interactive Multi-Agent Neural Cellular Automata

https://pub.sakana.ai/digital-ecosystem/

2•hardmaru•1h ago•0 comments

A Pragmatic Approach to Thorny People Problems

https://witnesstodestruction.blogspot.com/p/a-pragmatic-approach-to-thorny-people.html

2•basilikum•1h ago•0 comments

San Francisco Solved Metro Vandalism with One Neat Trick

https://www.theatlantic.com/ideas/2026/04/fare-gate-society-bart/686868/

7•mmcclure•1h ago•2 comments

KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit

https://arxiv.org/abs/2604.15356

41•EGreg•1h ago•25 comments