As; HN: I was curious why MTP affects PP TPS in llama.cpp. My PoC recovers it?

2•i_am_rocoe•11h ago

I've been running Qwen3.6-35B-A3B locally on llama.cpp and noticed that prompt processing throughput gets too low with MTP. I got nerd-sniped.

I'm not a C++ dev, I know almost nothing about ML, and I'm only scratching the surface of how LLMs work. What started as curiosity turned into a two-week rabbit hole of experiments and ended with a PoC that fully recovers the MTP PP overhead on GPU, above any expectation I had.

TL;DR: instead of processing the last layer MoE FFN for the entire ubatch tokens (usually 512-2048 tokens), this PoC processes only the output row (usually 1 token during prefill). The result is PP TPS is back to the same as with MTP disabled, in my bench that was an uplift of 20%, keeping most of MTP's benefits to TG TPS, even with a slight drop in draft acceptance rate in one of the benchs. More details in the branch readme: https://codeberg.org/rocoe/llama.cpp/src/branch/masked-nextn-skip-catchup/README.md

I worked with GLM 5.1 to write the code, Minimax M3 ran the tests and benchmarks on Modal and GLM 5.2 reviewed the work. GLM 5.1 is very smart and GLM 5.2 is capable of spotting deep side-effects in the code, no surprise it's at the top. Minimax M2.x were fast but lazy, M3 is a real leap and deserves more attention: it is smart, proactive, follows instructions and auto-corrects.

I'm not opening a PR to llama.cpp because this is AI-generated code, which goes against their contribution policy, which I support. If you know llama.cpp internals, you're invited to take a look at the PoC. I'll be happy to work alongside you to open a PR with a more mature implementation. This work is released under MIT, same as llama.cpp.

Happy to answer questions in the comments.

Comments

i_am_rocoe•11h ago

If you want to run the PoC locally to replicate:

Clone the masked-nextn-skip-catchup branch:

https://codeberg.org/rocoe/llama.cpp/src/branch/masked-nextn...

Run llama-server with at least --n-gpu-layers 99 --spec-type draft-mtp --spec-draft-n-max 2 --parallel 1 --no-cache-prompt.

I used Qwen3.6-35B-A3B-UD-Q2_K_XL MTP:

https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF?show...

I tested on a L40S (using Modal).

About the parameters:

--parallel 1, see Code review findings:

https://codeberg.org/rocoe/llama.cpp/src/branch/masked-nextn...

--no-cache-prompt, see Known limitations:

https://codeberg.org/rocoe/llama.cpp/src/branch/masked-nextn...

Ask HN: Where is the programming profession going?

Ask HN: Norway bans AI in elementary schools

Tell HN: OpenAI has started putting ads on paid programs

Decoupling Compute and Memory for Async GPUs

Trying to recover from thin content penalty from Google

Ask HN: How much coding should beginners learn in the AI era?

Ask HN: What surprised you about Estonia e-Residency and running an Estonian OÜ?

Ask HN: Quickbooks Alternative?

Google AI overview for "keynesian economics" is written in Korean

Ask HN: Do you thank your agents when they did a good job?

As; HN: I was curious why MTP affects PP TPS in llama.cpp. My PoC recovers it?

Got access to Gemini's actual thinking

Ask HN: What is one thing about AI that annoys you the most?

Ask HN: What home printer do you use/recommend?

Ask HN: What are the hardest problems AWS Lambda MicroVMs can solve now?

Ask HN: Will programmers write more efficient code during the memory shortage?

Overfitted a 900KB Transformer to Compress a 100MB CSV into 7MB

Ask HN: Yahoo deleted all my emails. Now what?

How to find AI-conservative companies to work for?

Ask HN: Why don't LLM harnesses enable/expose custom middleware hooks?

Ask HN: Am I missing something with AI

Ask HN: Is anyone using the A2A protocol?

Ask HN: I miss old days of blogging without promotions

Ask HN: What tools are you using for AI-assisted code review?

Ask HN: How are you finding work/gigs as a SWE?

Ask HN: Anthropic banned me from using Claude Code and I don't know what to do

Anyone else feels many LLMs are heavily biased towards consumerism these days?

Tell HN: I never bought anything from clicking on a paid ad

Ask HN: Are people optimistic about the future?

As; HN: I was curious why MTP affects PP TPS in llama.cpp. My PoC recovers it?

Comments

Ask HN: Where is the programming profession going?

Ask HN: Norway bans AI in elementary schools

Tell HN: OpenAI has started putting ads on paid programs

Decoupling Compute and Memory for Async GPUs

Trying to recover from thin content penalty from Google

Ask HN: How much coding should beginners learn in the AI era?

Ask HN: What surprised you about Estonia e-Residency and running an Estonian OÜ?

Ask HN: Quickbooks Alternative?

Google AI overview for "keynesian economics" is written in Korean

Ask HN: Do you thank your agents when they did a good job?

As; HN: I was curious why MTP affects PP TPS in llama.cpp. My PoC recovers it?

Got access to Gemini's actual thinking

Ask HN: What is one thing about AI that annoys you the most?

Ask HN: What home printer do you use/recommend?

Ask HN: What are the hardest problems AWS Lambda MicroVMs can solve now?

Ask HN: Will programmers write more efficient code during the memory shortage?

Overfitted a 900KB Transformer to Compress a 100MB CSV into 7MB

Ask HN: Yahoo deleted all my emails. Now what?

How to find AI-conservative companies to work for?

Ask HN: Why don't LLM harnesses enable/expose custom middleware hooks?

Ask HN: Am I missing something with AI

Ask HN: Is anyone using the A2A protocol?

Ask HN: I miss old days of blogging without promotions

Ask HN: What tools are you using for AI-assisted code review?

Ask HN: How are you finding work/gigs as a SWE?

Ask HN: Anthropic banned me from using Claude Code and I don't know what to do

Anyone else feels many LLMs are heavily biased towards consumerism these days?

Tell HN: I never bought anything from clicking on a paid ad

Ask HN: Are people optimistic about the future?