Accelerating Gemma 4: faster inference with multi-token prediction drafters

https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/

159•amrrs•2h ago

Comments

mchusma•1h ago

I find it puzzling Google doesn’t actively promote its own cloud for inference of Gemma 4. Open source is great, love it. But shouldn’t Google want me to be able to use and pay for it through Gemini and vertex?

Farmadupe•1h ago

I wonder if for a model that small with a permissive license it might not be worth their time to host a commercial grade inference stack?

Might be easier to chuck it over the fence and let other providers handle it as it'll run in almost any commercial grade card?

Also speculating, but I wonder if it might also create a bit of a pricing problem relative to Gemini flashlight depending on serving cost and quality of outputs?

As a comparison, despite being SotA for their size, the smallest qwen models on openrouter (27b and 35b) are not at all worth using, as there are way bigger and better models for less oricemon a per token basis

disiplus•1h ago

i dont know what are you talking about, i replaced an older gpt4o with a finetuned qwen. there is a huge amount of "AI, that can be done with those models, or partly by those models." Huge amount of people would not notice the difference. And if you prepare the context correctly, even bigger slice of people would not notice.

dakolli•1h ago

Genuinely curious, what are you "fine tuning" these smaller models to do reliably? I hear this talked about a lot but very few people actually cough up examples, and I'd love to actually hear of one.

disiplus•1h ago

depends, a super small one finetuned to do function calling instead sending it to big model and waiting, instead, you ask for a revenue in last month, i do a small llm function call -> show results. some bigger ones, analysis, summary, classification. what is great with smaller ones, and im looking at 2b, 4b is you can get a huge throughput with just vllm and a couple of consumer gpus. what i usually do is basically distillation of a big one onto smaller one.

Farmadupe•8m ago

If it helps, I mean it in a really literal sense. qwen3.6 27b is currently $3.20 per million tokens on openrouter right now which is way overpriced. As good as the 27b is, kimi k2.5 $3.00 and it's just in another league in terms of capability. There's no reason to spend money on it.

And even alibaba's own qwen3.6-plus is $1.95, so it's kinda easy to come to a conclusion that alibaba (nor anyone else) is really interested in hosting that model.

And don't get me wrong, I fully agree with you, qwen3.6 27b is an amazing model. I run it on my own hardware and every day I'm constantly surprised with what it can zero shot.

Havoc•1h ago

There is a decent yt here going through what google's logic with gemma overall might be

https://www.youtube.com/watch?v=sXgZhGzqPmU

As for why cloud offer it - think it's just an effort to promote the brand. The gemmas are pretty small so they can host it without it being a major drain on the company. They have the infra anyway

nolist_policy•51m ago

What do you mean? It just works with Google AI Studio.

WarmWash•19m ago

A key thing to understand about Google is that under the hood is a collection of extremely powerful fiefdoms (many of which would stand as their own fortune 500, hell 100) that are all trying to act in their own interest. It's almost closer to a conglomerate than a company, where Google needs to bid internally against external players for resources.

If Gemma 4 is less lucrative than Claude to the Google Cloud kingdom, the Cloud kingdom will want you using Claude.

these•1h ago

Has anyone managed to get this to work in LM Studio? They've got a option in the UI, but it never seems to allow me to enable it.

dvt•1h ago

It's not implemented in mlx[1] yet (or llama.cpp[2]), so it may take a while.

[1] https://github.com/ml-explore/mlx-lm/pull/990

[2] https://github.com/ggml-org/llama.cpp/pull/22673

svachalek•1h ago

I've gotten it to work with other models. They've got to be perfectly aligned usually, in terms of provider, quantization etc. Might be a bit before you can get a matched set.

Havoc•1h ago

Normally when LM Studio doesn't like it it's because of the presence of mmproj files in the folder. Sometimes removing them helps it show up.

They're somehow connected to vision & block speculative decode...don't ask me how/why though

For gemma specifically had more luck with speculative using the llama-server route than lm studio

AlphaSite•1h ago

Yes. Make sure you’re not using the Gemma sparse models since they don’t have a small model to use. Also I removed all the image models from the workspace.

disiplus•1h ago

nice, will run it later agains qwen3.6 27b, the speed was one of the reasons why in was running qwen and not gemma. the difference was big, there is some magic that happpens when you have more then 100tps.

zdw•1h ago

MTP support is being addedto llama.cpp, at least for the Qwen models ( https://github.com/ggml-org/llama.cpp/pull/20533) and I'd imagine Gemma 4 will come soon.

The performance uplift on local/self-hosted models in both quality and speed has been amazing in the last few months.

dakolli•1h ago

yet, still mostly useless.

EGreg•1h ago

How does this get added in practice?

flakiness•1h ago

According to the linked PR, the original model does come with MTP which is another "head" (=output path) in the same model and (supposedly) runs faster.

The current implementation ignores that head but the PR let the tool recognize it, plus does proper integration (run the MTP while running the slower main path then compare the result, I believe.)

WhitneyLand•1h ago

Yeah important conceptually to remember MTP is kind of just more weights, but speculative decoding is the runtime algorithm that’s a significant add to whatever code is serving the model.

HumanOstrich•1h ago

That is.. inaccurate.

tarruda•1h ago

There is a newer PR which will probably be merged soon: https://github.com/ggml-org/llama.cpp/pull/22673

entropicdrifter•14m ago

Ollama merged a PR for MTP about 2 hours ago, as well:

https://github.com/ollama/ollama/pull/15980

Edit: Seems they also have a pre-release version out with the functionality added: https://github.com/ollama/ollama/releases/tag/v0.23.1-rc0

basch•12m ago

I have a dumb performance question.

Why when asking a model to change text in a minor way; are we not asking it to generate the operational transformations necessary to modify the text, and then just executing the ot on the existing text vs reproducing every token? Maybe tools are doing that more than I realize?

skybrian•1h ago

Watching the computer write text sort of reminds me of using a modem to call a BBS in the old days. This seems like going from 300 baud to 1200 - a significant improvement, but still pretty slow, and someday we will wonder how we put up with it.

macNchz•1h ago

This is something I've been thinking about for a while...the current state of things really does feel kind of like the dialup era, wondering what the "broadband" era could look like. Watching tokens stream in is reminiscent of watching a jpeg load a few rows of pixels at a time, and the various different loading and connecting animations that applications implemented before things got fast enough to make them less relevant.

Some of the work in that direction like Cerebras or Taalas have been doing is an interesting glimpse of what might be possible. In the meantime it's a fun thought experiment to wonder about what might be possible if even current state of the art models were available at like, a million tokens per second at a very low cost.

jeffhuys•45m ago

Check chatjimmy.ai

garciasn•39m ago

You're right about it being reminiscent of the dial-up area, but I don't believe it's 300 to 1200; it's more like 4800:

Modem vs Claude according to Claude:

300 @ 2368 characters - 1m 19s

1200 @ 2368 characters - 19.7s

2400 @ 2368 characters - 9.9s

14.4K @ 2368 characters - 1.6s

33.6K @ 2368 characters - 705 ms

56K @ 2368 characters - 447 ms

Claude @ 2368 characters - 7.9s

MagicMoonlight•31m ago

There was a startup posted here which built custom hardware that let the AI respond instantly. Thousands of tokens per second.

zargon•25m ago

Groq.

shay_ker•1h ago

curious that they are doing speculative decoding and not baking MTP into the model, like Nemotron

https://docs.nvidia.com/megatron-core/developer-guide/0.15.0...

zargon•1h ago

They're using the term speculative decoding but doing MTP. It's the same thing as Nemotron, but Google removed the MTP heads from the original safetensora release. (They were not removed from the LiteRM format.)

nalinidash•1h ago

technical details are here: https://x.com/googlegemma/status/2051694045869879749

pu_pe•1h ago

So much faster inference with no quality degradation? All that for just some small memory overhead (drafter models are <1B it seems)?

tarruda•57m ago

They also published draft models for E4B and E2B. For those, the draft models are only 78m parameters: https://huggingface.co/google/gemma-4-E4B-it-assistant

coder543•52m ago

MTP requires a separate KV cache, so there is more memory overhead than just the weights of the MTP model, but it's a manageable amount.

a_e_k•30m ago

From the linked post, it didn't read like a separate KV cache was needed:

> The draft models seamlessly utilize the target model's activations and share its KV cache, meaning they don't have to waste time recalculating context the larger model has already figured out.

coder543•27m ago

That's great news. That has not been the case with other MTP implementations like Qwen3.5, but I see the section in the article saying Google introduced some optimizations to make this possible.

christina97•1h ago

I recently set up the 26B A4B model up on vLLM on an RTX3090 (4-bit) after a hiatus from local models. Just completely blown away by the speed and quality you can get now for sub-$1k investment.

I tried first with Qwen but it was unstable and had ridiculously long thinning traces!

jszymborski•1h ago

The A4B model is blazing fast and the model is super good at general inquiries. Notably worse than Qwen 3.6 for coding tasks but that says more about the Qwen model.

deskamess•1h ago

Did DeepSeek come up with MTP? It was listed prominently in their recent paper as being carried forward from the previous release.

brcmthrowaway•1h ago

Is Google's local model strategy tuned to pegging down big AI cloud labs a notch?

m3kw9•1h ago

ok so? Anyone got a verdict/review?

recsv-heredoc•59m ago

CloudFlare offers excellent service for many of the open-weights models. It's fast, cheap and simple to set up. Can highly suggest as an LLM provider.

They serve gemma-4-26b-a4b-it.

andruby•51m ago

They do indeed. See https://developers.cloudflare.com/workers-ai/models/ They seem to allow some free usage without user account. Do they list limits anywhere?

julianlam•35m ago

Really excited to try this once it is merged into llama.cpp.

Gemma 4 26B-A4B is much quicker on my setup vs Qwen3.6-35B-A3B (by about 3x), so the thought of a 1.5 speedup is tantalizing.

Have tried draft models to limited success (the smaller 3B draft model in addition to a dense 14B Ministral model introduced too much overhead already)

VHRanger•7m ago

On vllm with a 5090 I get 120-180TPS with the awq 4 bit quant + MTP speculative decoding

For gemma4 26B, same quantization, I get >200TPS.

Also note that qwen is extremely inefficient in reasoning; the reasoning chains are ~3x longer than gemma on average

msp26•30m ago

Google is singlehandedly carrying western open source models. Gemma 4 31B is fantastic.

However, it is a little painful to try to fit the best possible version into 24GB vram with vision + this drafter soon. My build doesn't support any more GPUs and I believe I would want another 4090 (overpriced) for best performance or otherwise just replace it altogether.

ActorNightly•16m ago

Qwen is still better that Gemma though. Also you can tune it more for different tasks, which means that you can prioritize thinking and accuracy versus inference speed.

AbuAssar•26m ago

these are the updated models:

google/gemma-4-31B-it-assistant

google/gemma-4-26B-A4B-it-assistant

google/gemma-4-E4B-it-assistant

google/gemma-4-E2B-it-assistant

ActorNightly•23m ago

I found that Gemma 4:26b makes way more mistakes compared to Qwen and Gemma 3. Gemma3 27b QAT was my goto for some time as this was quite fast. Qwen is still king for a balance of accuracy and inference speed.

Gemma:31b was more accurate but speed was horrendous.

Patrick_Devine•19m ago

In my testing the Gemma 4 31b model had the biggest speed boost in Ollama w/ the MLX runner for coding tasks (at about 2x). Unfortunately you'll need a pretty beefy Mac to run it because quantization really hurts the acceptance rate. The three other smaller models didn't perform as well because the validation time of the draft model ate up most of the performance gains. I'm still trying to tune things to see if I can get better performance.

You can try it out with Ollama 0.23.1 by running `ollama run gemma4:31b-coding-mtp-bf16`.

vhiremath4•17m ago

So this is like branch prediction for operating systems? Except we have probability baked into the model itself so it’s even more reliable.

WarmWash•9m ago

I don't see it talked about much, but Gemma (and gemini) use enormously less tokens per output than other models, while still staying within arms reach of top benchmark performance.

It's not uncommon to see a gemma vs qwen comparison, where qwen does a bit better, but spent 22 minutes on the task, while gemma aligned the buttons wrong, but only spent 4 minutes on the same prompt. So taken at face value, gemma is now under performing leading open models by 5-10%, but doing it in 1/10th the time.

rjh29•3m ago

Anecdotally the 15/month basic Gemini plan allows coding all day. I'm not hitting the limits or needing to upgrade to 100/month plans like other people are doing with Claude or Codex.

Caveat: Gemini has been dumbed down a few times over the last year. Rate limits tightened up too. So it might not be this good in the future.

Urahandystar•3m ago

True, but you have to add up the cumulative token output if your being fair. That alignment issue requires another set of input and output tokens to correct.

IBM didn't want Microsoft to use the Tab key to move between dialog fields

Three Inverse Laws of AI

Accelerating Gemma 4: faster inference with multi-token prediction drafters

Clarification on the Notepad++ Trademark Issue

EEVblog: The 555 Timer is 55 years old

Computer Use Is 45x More Expensive Than Structured APIs

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

I'm Scared About Biological Computing

Proliferate (YC S25) Is Hiring- 200k for junior engineers

UK: Two millionth electric car registered as market rebounds strongly

Should I Run Plain Docker Compose in Production in 2026?

Async Rust never left the MVP state

Collaborative Editing in CodeMirror

iOS 27 is adding a 'Create a Pass' button to Apple Wallet

Docker 29 has changed its default image store for new installs

Quantum Key Distribution (QKD) and Quantum Cryptography (QC)

Show HN: Airbyte Agents – context for agents across multiple data sources

Comparing the Z80 and 6502 to Their Relatives

Adding a feature to a closed-source app

California farmers to destroy 420k peach trees following Del Monte bankruptcy

Empty Screenings – Finds AMC movie screenings with few or no tickets sold

When everyone has AI and the company still learns nothing

Simple Meta-Harness on Islo.dev

Incident with Actions

Agents for financial services and insurance

AI didn't delete your database, you did

Google Chrome silently installs a 4 GB AI model on your device without consent

Lessons for Agentic Coding: What should we do when code is cheap?

The first photo published in a newspaper

Did I photograph the Aurora or was it something else? (2016)

IBM didn't want Microsoft to use the Tab key to move between dialog fields

Three Inverse Laws of AI

Accelerating Gemma 4: faster inference with multi-token prediction drafters

Clarification on the Notepad++ Trademark Issue

EEVblog: The 555 Timer is 55 years old

Computer Use Is 45x More Expensive Than Structured APIs

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

I'm Scared About Biological Computing

Proliferate (YC S25) Is Hiring- 200k for junior engineers

UK: Two millionth electric car registered as market rebounds strongly

Should I Run Plain Docker Compose in Production in 2026?

Async Rust never left the MVP state

Collaborative Editing in CodeMirror

iOS 27 is adding a 'Create a Pass' button to Apple Wallet

Docker 29 has changed its default image store for new installs

Quantum Key Distribution (QKD) and Quantum Cryptography (QC)

Show HN: Airbyte Agents – context for agents across multiple data sources

Comparing the Z80 and 6502 to Their Relatives

Adding a feature to a closed-source app

California farmers to destroy 420k peach trees following Del Monte bankruptcy

Empty Screenings – Finds AMC movie screenings with few or no tickets sold

When everyone has AI and the company still learns nothing

Simple Meta-Harness on Islo.dev

Incident with Actions

Agents for financial services and insurance

AI didn't delete your database, you did

Google Chrome silently installs a 4 GB AI model on your device without consent

Lessons for Agentic Coding: What should we do when code is cheap?

The first photo published in a newspaper

Did I photograph the Aurora or was it something else? (2016)

Accelerating Gemma 4: faster inference with multi-token prediction drafters

Comments