frontpage.
newsnewestaskshowjobs

Open Source @Github

fp.

Open in hackernews

As; HN: I was curious why MTP affects PP TPS in llama.cpp. My PoC recovers it?

2•i_am_rocoe•11h ago
I've been running Qwen3.6-35B-A3B locally on llama.cpp and noticed that prompt processing throughput gets too low with MTP. I got nerd-sniped.

I'm not a C++ dev, I know almost nothing about ML, and I'm only scratching the surface of how LLMs work. What started as curiosity turned into a two-week rabbit hole of experiments and ended with a PoC that fully recovers the MTP PP overhead on GPU, above any expectation I had.

TL;DR: instead of processing the last layer MoE FFN for the entire ubatch tokens (usually 512-2048 tokens), this PoC processes only the output row (usually 1 token during prefill). The result is PP TPS is back to the same as with MTP disabled, in my bench that was an uplift of 20%, keeping most of MTP's benefits to TG TPS, even with a slight drop in draft acceptance rate in one of the benchs. More details in the branch readme: https://codeberg.org/rocoe/llama.cpp/src/branch/masked-nextn-skip-catchup/README.md

I worked with GLM 5.1 to write the code, Minimax M3 ran the tests and benchmarks on Modal and GLM 5.2 reviewed the work. GLM 5.1 is very smart and GLM 5.2 is capable of spotting deep side-effects in the code, no surprise it's at the top. Minimax M2.x were fast but lazy, M3 is a real leap and deserves more attention: it is smart, proactive, follows instructions and auto-corrects.

I'm not opening a PR to llama.cpp because this is AI-generated code, which goes against their contribution policy, which I support. If you know llama.cpp internals, you're invited to take a look at the PoC. I'll be happy to work alongside you to open a PR with a more mature implementation. This work is released under MIT, same as llama.cpp.

Happy to answer questions in the comments.

Comments

i_am_rocoe•11h ago
If you want to run the PoC locally to replicate:

Clone the masked-nextn-skip-catchup branch:

https://codeberg.org/rocoe/llama.cpp/src/branch/masked-nextn...

Run llama-server with at least --n-gpu-layers 99 --spec-type draft-mtp --spec-draft-n-max 2 --parallel 1 --no-cache-prompt.

I used Qwen3.6-35B-A3B-UD-Q2_K_XL MTP:

https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF?show...

I tested on a L40S (using Modal).

About the parameters:

--parallel 1, see Code review findings:

https://codeberg.org/rocoe/llama.cpp/src/branch/masked-nextn...

--no-cache-prompt, see Known limitations:

https://codeberg.org/rocoe/llama.cpp/src/branch/masked-nextn...

Ask HN: Where is the programming profession going?

105•syntaxbush•23h ago•108 comments

Ask HN: Norway bans AI in elementary schools

3•mellosty•2h ago•5 comments

Tell HN: OpenAI has started putting ads on paid programs

108•shantnutiwari•12h ago•54 comments

Decoupling Compute and Memory for Async GPUs

7•yiyingzhang•8h ago•2 comments

Trying to recover from thin content penalty from Google

3•anitroves•6h ago•2 comments

Ask HN: How much coding should beginners learn in the AI era?

30•JohnDSDev•1d ago•41 comments

Ask HN: What surprised you about Estonia e-Residency and running an Estonian OÜ?

75•jvilalta•11h ago•62 comments

Ask HN: Quickbooks Alternative?

2•bix6•7h ago•0 comments

Google AI overview for "keynesian economics" is written in Korean

4•something765478•8h ago•2 comments

Ask HN: Do you thank your agents when they did a good job?

5•ex-aws-dude•10h ago•9 comments

As; HN: I was curious why MTP affects PP TPS in llama.cpp. My PoC recovers it?

2•i_am_rocoe•11h ago•1 comments

Got access to Gemini's actual thinking

4•StizzurpXDD•16h ago•0 comments

Ask HN: What is one thing about AI that annoys you the most?

4•akashwadhwani35•6h ago•6 comments

Ask HN: What home printer do you use/recommend?

18•niyazpk•2d ago•21 comments

Ask HN: What are the hardest problems AWS Lambda MicroVMs can solve now?

6•iaziz786•1d ago•1 comments

Ask HN: Will programmers write more efficient code during the memory shortage?

153•amichail•6d ago•246 comments

Overfitted a 900KB Transformer to Compress a 100MB CSV into 7MB

8•spidy__•2d ago•10 comments

Ask HN: Yahoo deleted all my emails. Now what?

15•neya•1d ago•12 comments

How to find AI-conservative companies to work for?

20•tossitawayplz•2d ago•12 comments

Ask HN: Why don't LLM harnesses enable/expose custom middleware hooks?

8•fur-tea-laser•1d ago•7 comments

Ask HN: Am I missing something with AI

15•vasko•2d ago•22 comments

Ask HN: Is anyone using the A2A protocol?

96•asim•1w ago•45 comments

Ask HN: I miss old days of blogging without promotions

8•throwaw12•1d ago•12 comments

Ask HN: What tools are you using for AI-assisted code review?

25•agos•1w ago•30 comments

Ask HN: How are you finding work/gigs as a SWE?

10•mariopt•1d ago•7 comments

Ask HN: Anthropic banned me from using Claude Code and I don't know what to do

80•ayi•2d ago•92 comments

Anyone else feels many LLMs are heavily biased towards consumerism these days?

8•pyeri•1d ago•4 comments

Tell HN: I never bought anything from clicking on a paid ad

22•julienreszka•2d ago•29 comments

Ask HN: Are people optimistic about the future?

43•JohnDSDev•5d ago•91 comments

You've reached the end!