I'm not a C++ dev, I know almost nothing about ML, and I'm only scratching the surface of how LLMs work. What started as curiosity turned into a two-week rabbit hole of experiments and ended with a PoC that fully recovers the MTP PP overhead on GPU, above any expectation I had.
TL;DR: instead of processing the last layer MoE FFN for the entire ubatch tokens (usually 512-2048 tokens), this PoC processes only the output row (usually 1 token during prefill). The result is PP TPS is back to the same as with MTP disabled, in my bench that was an uplift of 20%, keeping most of MTP's benefits to TG TPS, even with a slight drop in draft acceptance rate in one of the benchs. More details in the branch readme: https://codeberg.org/rocoe/llama.cpp/src/branch/masked-nextn-skip-catchup/README.md
I worked with GLM 5.1 to write the code, Minimax M3 ran the tests and benchmarks on Modal and GLM 5.2 reviewed the work. GLM 5.1 is very smart and GLM 5.2 is capable of spotting deep side-effects in the code, no surprise it's at the top. Minimax M2.x were fast but lazy, M3 is a real leap and deserves more attention: it is smart, proactive, follows instructions and auto-corrects.
I'm not opening a PR to llama.cpp because this is AI-generated code, which goes against their contribution policy, which I support. If you know llama.cpp internals, you're invited to take a look at the PoC. I'll be happy to work alongside you to open a PR with a more mature implementation. This work is released under MIT, same as llama.cpp.
Happy to answer questions in the comments.