Accelerating Gemma 4: faster inference with multi-token prediction drafters

https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/

99•amrrs•1h ago

Comments

mchusma•1h ago

I find it puzzling Google doesn’t actively promote its own cloud for inference of Gemma 4. Open source is great, love it. But shouldn’t Google want me to be able to use and pay for it through Gemini and vertex?

Farmadupe•58m ago

I wonder if for a model that small with a permissive license it might not be worth their time to host a commercial grade inference stack?

Might be easier to chuck it over the fence and let other providers handle it as it'll run in almost any commercial grade card?

Also speculating, but I wonder if it might also create a bit of a pricing problem relative to Gemini flashlight depending on serving cost and quality of outputs?

As a comparison, despite being SotA for their size, the smallest qwen models on openrouter (27b and 35b) are not at all worth using, as there are way bigger and better models for less oricemon a per token basis

disiplus•46m ago

i dont know what are you talking about, i replaced an older gpt4o with a finetuned qwen. there is a huge amount of "AI, that can be done with those models, or partly by those models." Huge amount of people would not notice the difference. And if you prepare the context correctly, even bigger slice of people would not notice.

dakolli•40m ago

Genuinely curious, what are you "fine tuning" these smaller models to do reliably? I hear this talked about a lot but very few people actually cough up examples, and I'd love to actually hear of one.

disiplus•17m ago

depends, a super small one finetuned to do function calling instead sending it to big model and waiting, instead, you ask for a revenue in last month, i do a small llm function call -> show results. some bigger ones, analysis, summary, classification. what is great with smaller ones, and im looking at 2b, 4b is you can get a huge throughput with just vllm and a couple of consumer gpus. what i usually do is basically distillation of a big one onto smaller one.

Havoc•26m ago

There is a decent yt here going through what google's logic with gemma overall might be

https://www.youtube.com/watch?v=sXgZhGzqPmU

As for why cloud offer it - think it's just an effort to promote the brand. The gemmas are pretty small so they can host it without it being a major drain on the company. They have the infra anyway

these•59m ago

Has anyone managed to get this to work in LM Studio? They've got a option in the UI, but it never seems to allow me to enable it.

dvt•48m ago

It's not implemented in mlx[1] yet (or llama.cpp[2]), so it may take a while.

[1] https://github.com/ml-explore/mlx-lm/pull/990

[2] https://github.com/ggml-org/llama.cpp/pull/22673

svachalek•39m ago

I've gotten it to work with other models. They've got to be perfectly aligned usually, in terms of provider, quantization etc. Might be a bit before you can get a matched set.

Havoc•29m ago

Normally when LM Studio doesn't like it it's because of the presence of mmproj files in the folder. Sometimes removing them helps it show up.

They're somehow connected to vision & block speculative decode...don't ask me how/why though

For gemma specifically had more luck with speculative using the llama-server route than lm studio

AlphaSite•25m ago

Yes. Make sure you’re not using the Gemma sparse models since they don’t have a small model to use. Also I removed all the image models from the workspace.

disiplus•54m ago

nice, will run it later agains qwen3.6 27b, the speed was one of the reasons why in was running qwen and not gemma. the difference was big, there is some magic that happpens when you have more then 100tps.

zdw•48m ago

MTP support is being addedto llama.cpp, at least for the Qwen models ( https://github.com/ggml-org/llama.cpp/pull/20533) and I'd imagine Gemma 4 will come soon.

The performance uplift on local/self-hosted models in both quality and speed has been amazing in the last few months.

dakolli•44m ago

yet, still mostly useless.

EGreg•32m ago

How does this get added in practice?

flakiness•27m ago

According to the linked PR, the original model does come with MTP which is another "head" (=output path) in the same model and (supposedly) runs faster.

The current implementation ignores that head but the PR let the tool recognize it, plus does proper integration (run the MTP while running the slower main path then compare the result, I believe.)

WhitneyLand•29m ago

Yeah important conceptually to remember MTP is kind of just more weights, but speculative decoding is the runtime algorithm that’s a significant add to whatever code is serving the model.

HumanOstrich•13m ago

That is.. inaccurate.

tarruda•17m ago

There is a newer PR which will probably be merged soon: https://github.com/ggml-org/llama.cpp/pull/22673

skybrian•40m ago

Watching the computer write text sort of reminds me of using a modem to call a BBS in the old days. This seems like going from 300 baud to 1200 - a significant improvement, but still pretty slow, and someday we will wonder how we put up with it.

macNchz•17m ago

This is something I've been thinking about for a while...the current state of things really does feel kind of like the dialup era, wondering what the "broadband" era could look like. Watching tokens stream in is reminiscent of watching a jpeg load a few rows of pixels at a time, and the various different loading and connecting animations that applications implemented before things got fast enough to make them less relevant.

Some of the work in that direction like Cerebras or Taalas have been doing is an interesting glimpse of what might be possible. In the meantime it's a fun thought experiment to wonder about what might be possible if even current state of the art models were available at like, a million tokens per second at a very low cost.

shay_ker•30m ago

curious that they are doing speculative decoding and not baking MTP into the model, like Nemotron

https://docs.nvidia.com/megatron-core/developer-guide/0.15.0...

zargon•24m ago

They're using the term speculative decoding but doing MTP. It's the same thing as Nemotron, but Google removed the MTP heads from the original safetensora release. (They were not removed from the LiteRM format.)

nalinidash•28m ago

technical details are here: https://x.com/googlegemma/status/2051694045869879749

pu_pe•22m ago

So much faster inference with no quality degradation? All that for just some small memory overhead (drafter models are <1B it seems)?

tarruda•9m ago

They also published draft models for E4B and E2B. For those, the draft models are only 78m parameters: https://huggingface.co/google/gemma-4-E4B-it-assistant

christina97•16m ago

I recently set up the 26B A4B model up on vLLM on an RTX3090 (4-bit) after a hiatus from local models. Just completely blown away by the speed and quality you can get now for sub-$1k investment.

I tried first with Qwen but it was unstable and had ridiculously long thinning traces!

jszymborski•14m ago

The A4B model is blazing fast and the model is super good at general inquiries. Notably worse than Qwen 3.6 for coding tasks but that says more about the Qwen model.

deskamess•13m ago

Did DeepSeek come up with MTP? It was listed prominently in their recent paper as being carried forward from the previous release.

brcmthrowaway•13m ago

Is Google's local model strategy tuned to pegging down big AI cloud labs a notch?

m3kw9•12m ago

ok so? Anyone got a verdict/review?

recsv-heredoc•10m ago

CloudFlare offers excellent service for many of the open-weights models. It's fast, cheap and simple to set up. Can highly suggest as an LLM provider.

They serve gemma-4-26b-a4b-it.

Oil 101, Second Edition

An Open Letter to Jay Bhattacharya

Show HN: I built a spoiler-free WWE dashboard for 2001-2019 with 15,000 matches

PostHog Code

Nostr Mail – Nostr Mail Documentation

Spaces Protocol May 2026 Update

Orbee chat: your name, your people, your rules

Changes in Hospital Finance, Operations and Quality After Management Consultants

DigitalOcean's NYC region looked fine – until we ran it again

Understand EOB and medical bill text locally in Chrome

OpenAI smartphone leak reveals next-gen chipset and more details

Detecting silent LLM agent degradation before users do

UALink AI Accelerator Spec Maintains Rapid Update Pace

The exotic particles that could break the Standard Model

Quantum Key Distribution (QKD) and Quantum Cryptography (QC)

Teeny-Tiny Notes

National space weather center on chopping block

David Attenborough, 'the voice for nature,' turns 100

Dreamer: Make any coding agent self-evolving, across the whole team

The Other Twin Towers in the Spider-Man Trailer

CBOMkit: Explore the Use of Cryptography in Software

Tokens and Dreams

Curious cases of financial engineering in biotech

Cross-target schema drift in Cal.com: 1 finding in 1096 fields

Congress Is Doing Little to Prepare for Potential A.I. Job Losses

Eight vaccines linked to a lower risk of dementia

IBM didn't want Microsoft to use the Tab key to move between dialog fields

Wearables Are Going Off the Rails

Humane AI Pin hacks turns the gadget into a standalone Android-powered gadget

Flattery jailbreaks Claude into giving bomb-making instructions