frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Accelerating Gemma 4: faster inference with multi-token prediction drafters

https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/
99•amrrs•1h ago

Comments

mchusma•1h ago
I find it puzzling Google doesn’t actively promote its own cloud for inference of Gemma 4. Open source is great, love it. But shouldn’t Google want me to be able to use and pay for it through Gemini and vertex?
Farmadupe•58m ago
I wonder if for a model that small with a permissive license it might not be worth their time to host a commercial grade inference stack?

Might be easier to chuck it over the fence and let other providers handle it as it'll run in almost any commercial grade card?

Also speculating, but I wonder if it might also create a bit of a pricing problem relative to Gemini flashlight depending on serving cost and quality of outputs?

As a comparison, despite being SotA for their size, the smallest qwen models on openrouter (27b and 35b) are not at all worth using, as there are way bigger and better models for less oricemon a per token basis

disiplus•46m ago
i dont know what are you talking about, i replaced an older gpt4o with a finetuned qwen. there is a huge amount of "AI, that can be done with those models, or partly by those models." Huge amount of people would not notice the difference. And if you prepare the context correctly, even bigger slice of people would not notice.
dakolli•40m ago
Genuinely curious, what are you "fine tuning" these smaller models to do reliably? I hear this talked about a lot but very few people actually cough up examples, and I'd love to actually hear of one.
disiplus•17m ago
depends, a super small one finetuned to do function calling instead sending it to big model and waiting, instead, you ask for a revenue in last month, i do a small llm function call -> show results. some bigger ones, analysis, summary, classification. what is great with smaller ones, and im looking at 2b, 4b is you can get a huge throughput with just vllm and a couple of consumer gpus. what i usually do is basically distillation of a big one onto smaller one.
Havoc•26m ago
There is a decent yt here going through what google's logic with gemma overall might be

https://www.youtube.com/watch?v=sXgZhGzqPmU

As for why cloud offer it - think it's just an effort to promote the brand. The gemmas are pretty small so they can host it without it being a major drain on the company. They have the infra anyway

these•59m ago
Has anyone managed to get this to work in LM Studio? They've got a option in the UI, but it never seems to allow me to enable it.
dvt•48m ago
It's not implemented in mlx[1] yet (or llama.cpp[2]), so it may take a while.

[1] https://github.com/ml-explore/mlx-lm/pull/990

[2] https://github.com/ggml-org/llama.cpp/pull/22673

svachalek•39m ago
I've gotten it to work with other models. They've got to be perfectly aligned usually, in terms of provider, quantization etc. Might be a bit before you can get a matched set.
Havoc•29m ago
Normally when LM Studio doesn't like it it's because of the presence of mmproj files in the folder. Sometimes removing them helps it show up.

They're somehow connected to vision & block speculative decode...don't ask me how/why though

For gemma specifically had more luck with speculative using the llama-server route than lm studio

AlphaSite•25m ago
Yes. Make sure you’re not using the Gemma sparse models since they don’t have a small model to use. Also I removed all the image models from the workspace.
disiplus•54m ago
nice, will run it later agains qwen3.6 27b, the speed was one of the reasons why in was running qwen and not gemma. the difference was big, there is some magic that happpens when you have more then 100tps.
zdw•48m ago
MTP support is being addedto llama.cpp, at least for the Qwen models ( https://github.com/ggml-org/llama.cpp/pull/20533) and I'd imagine Gemma 4 will come soon.

The performance uplift on local/self-hosted models in both quality and speed has been amazing in the last few months.

dakolli•44m ago
yet, still mostly useless.
EGreg•32m ago
How does this get added in practice?
flakiness•27m ago
According to the linked PR, the original model does come with MTP which is another "head" (=output path) in the same model and (supposedly) runs faster.

The current implementation ignores that head but the PR let the tool recognize it, plus does proper integration (run the MTP while running the slower main path then compare the result, I believe.)

WhitneyLand•29m ago
Yeah important conceptually to remember MTP is kind of just more weights, but speculative decoding is the runtime algorithm that’s a significant add to whatever code is serving the model.
HumanOstrich•13m ago
That is.. inaccurate.
tarruda•17m ago
There is a newer PR which will probably be merged soon: https://github.com/ggml-org/llama.cpp/pull/22673
skybrian•40m ago
Watching the computer write text sort of reminds me of using a modem to call a BBS in the old days. This seems like going from 300 baud to 1200 - a significant improvement, but still pretty slow, and someday we will wonder how we put up with it.
macNchz•17m ago
This is something I've been thinking about for a while...the current state of things really does feel kind of like the dialup era, wondering what the "broadband" era could look like. Watching tokens stream in is reminiscent of watching a jpeg load a few rows of pixels at a time, and the various different loading and connecting animations that applications implemented before things got fast enough to make them less relevant.

Some of the work in that direction like Cerebras or Taalas have been doing is an interesting glimpse of what might be possible. In the meantime it's a fun thought experiment to wonder about what might be possible if even current state of the art models were available at like, a million tokens per second at a very low cost.

shay_ker•30m ago
curious that they are doing speculative decoding and not baking MTP into the model, like Nemotron

https://docs.nvidia.com/megatron-core/developer-guide/0.15.0...

zargon•24m ago
They're using the term speculative decoding but doing MTP. It's the same thing as Nemotron, but Google removed the MTP heads from the original safetensora release. (They were not removed from the LiteRM format.)
nalinidash•28m ago
technical details are here: https://x.com/googlegemma/status/2051694045869879749
pu_pe•22m ago
So much faster inference with no quality degradation? All that for just some small memory overhead (drafter models are <1B it seems)?
tarruda•9m ago
They also published draft models for E4B and E2B. For those, the draft models are only 78m parameters: https://huggingface.co/google/gemma-4-E4B-it-assistant
christina97•16m ago
I recently set up the 26B A4B model up on vLLM on an RTX3090 (4-bit) after a hiatus from local models. Just completely blown away by the speed and quality you can get now for sub-$1k investment.

I tried first with Qwen but it was unstable and had ridiculously long thinning traces!

jszymborski•14m ago
The A4B model is blazing fast and the model is super good at general inquiries. Notably worse than Qwen 3.6 for coding tasks but that says more about the Qwen model.
deskamess•13m ago
Did DeepSeek come up with MTP? It was listed prominently in their recent paper as being carried forward from the previous release.
brcmthrowaway•13m ago
Is Google's local model strategy tuned to pegging down big AI cloud labs a notch?
m3kw9•12m ago
ok so? Anyone got a verdict/review?
recsv-heredoc•10m ago
CloudFlare offers excellent service for many of the open-weights models. It's fast, cheap and simple to set up. Can highly suggest as an LLM provider.

They serve gemma-4-26b-a4b-it.

Oil 101, Second Edition

https://oil101.morgandowney.com
1•mxschumacher•31s ago•0 comments

An Open Letter to Jay Bhattacharya

https://www.science.org/content/blog-post/open-letter-jay-bhattacharya
2•jeromechoo•1m ago•0 comments

Show HN: I built a spoiler-free WWE dashboard for 2001-2019 with 15,000 matches

https://warner-wvez.github.io/wrestling-dashboard/
1•wvez22•2m ago•0 comments

PostHog Code

https://posthog.com/code
4•bewal416•2m ago•0 comments

Nostr Mail – Nostr Mail Documentation

https://nogringo.github.io/nostr-mail/#what-is-nostr-mail
2•janandonly•3m ago•0 comments

Spaces Protocol May 2026 Update

https://spacesprotocol.org/blog/may-2026-update/
1•ca98am79•3m ago•0 comments

Orbee chat: your name, your people, your rules

https://orbee.chat/
1•ca98am79•4m ago•0 comments

Changes in Hospital Finance, Operations and Quality After Management Consultants

https://jamanetwork.com/journals/jama/article-abstract/2848641
1•randycupertino•4m ago•1 comments

DigitalOcean's NYC region looked fine – until we ran it again

https://webbynode.com/articles/digitalocean-nyc1-performance-drops-over-time
3•gsgreen•6m ago•0 comments

Understand EOB and medical bill text locally in Chrome

https://chromewebstore.google.com/detail/keepmd-eob-decoder/dojjljfafpojmbhjljnkpglmahhglbco
1•teddyX•6m ago•1 comments

OpenAI smartphone leak reveals next-gen chipset and more details

https://www.notebookcheck.net/OpenAI-smartphone-leak-reveals-next-gen-chipset-and-more-details.12...
1•thunderbong•6m ago•0 comments

Detecting silent LLM agent degradation before users do

https://www.ainative.builders/platform/silent-agent-degradation-detection
2•v1b3•6m ago•1 comments

UALink AI Accelerator Spec Maintains Rapid Update Pace

https://www.eetimes.com/ai-accelerator-spec-maintains-rapid-update-pace/
1•mindcrime•8m ago•0 comments

The exotic particles that could break the Standard Model

https://www.nature.com/articles/d41586-026-01387-x
2•digital55•8m ago•0 comments

Quantum Key Distribution (QKD) and Quantum Cryptography (QC)

https://www.nsa.gov/Cybersecurity/Quantum-Key-Distribution-QKD-and-Quantum-Cryptography-QC/
4•mooreds•10m ago•0 comments

Teeny-Tiny Notes

https://khoaly.xyz/teeny-tiny-notes/
1•speckx•10m ago•0 comments

National space weather center on chopping block

https://www.nytimes.com/2026/03/13/climate/ncar-breakup-plan-nasa-noaa.html
1•eliascanetti•13m ago•0 comments

David Attenborough, 'the voice for nature,' turns 100

https://www.reuters.com/world/uk/david-attenborough-the-voice-nature-turns-100-2026-05-05/
1•jmsflknr•13m ago•0 comments

Dreamer: Make any coding agent self-evolving, across the whole team

https://github.com/luml-ai/dreamer
2•iryna_kondr•14m ago•1 comments

The Other Twin Towers in the Spider-Man Trailer

https://ironicsans.ghost.io/the-other-twin-towers-in-the-spider-man-trailer/
2•caminanteblanco•15m ago•0 comments

CBOMkit: Explore the Use of Cryptography in Software

https://www.zurich.ibm.com/cbom/
2•mooreds•16m ago•0 comments

Tokens and Dreams

https://charlesleifer.com/blog/tokens-and-dreams/
2•cleifer•16m ago•0 comments

Curious cases of financial engineering in biotech

https://www.owlposting.com/p/curious-cases-of-financial-engineering
1•abhishaike•16m ago•0 comments

Cross-target schema drift in Cal.com: 1 finding in 1096 fields

https://github.com/wiaahmarketplace/typerion-oss/tree/main/examples/case-studies/calcom
1•Techman92•17m ago•0 comments

Congress Is Doing Little to Prepare for Potential A.I. Job Losses

https://www.nytimes.com/2026/05/05/business/artificial-intelligence-safety-net.html
2•cdrnsf•19m ago•2 comments

Eight vaccines linked to a lower risk of dementia

https://www.gavi.org/vaccineswork/eight-vaccines-linked-lower-risk-dementia
5•ivankra•19m ago•0 comments

IBM didn't want Microsoft to use the Tab key to move between dialog fields

https://devblogs.microsoft.com/oldnewthing/20260505-00/?p=112298
32•SeenNotHeard•20m ago•11 comments

Wearables Are Going Off the Rails

https://gizmodo.com/wearables-are-going-fully-off-the-rails-2000754560
2•ulrischa•20m ago•0 comments

Humane AI Pin hacks turns the gadget into a standalone Android-powered gadget

https://liliputing.com/humane-ai-pin-hacks-turn-the-discontinued-gadget-into-a-standalone-android...
1•speckx•21m ago•0 comments

Flattery jailbreaks Claude into giving bomb-making instructions

https://www.theverge.com/ai-artificial-intelligence/923961/security-researchers-mindgard-gaslit-c...
1•AgentNews•21m ago•0 comments