They're somehow connected to vision & block speculative decode...don't ask me how/why though
For gemma specifically had more luck with speculative using the llama-server route than lm studio
The performance uplift on local/self-hosted models in both quality and speed has been amazing in the last few months.
The current implementation ignores that head but the PR let the tool recognize it, plus does proper integration (run the MTP while running the slower main path then compare the result, I believe.)
https://github.com/ollama/ollama/pull/15980
Edit: Seems they also have a pre-release version out with the functionality added: https://github.com/ollama/ollama/releases/tag/v0.23.1-rc0
Why when asking a model to change text in a minor way; are we not asking it to generate the operational transformations necessary to modify the text, and then just executing the ot on the existing text vs reproducing every token? Maybe tools are doing that more than I realize?
Some of the work in that direction like Cerebras or Taalas have been doing is an interesting glimpse of what might be possible. In the meantime it's a fun thought experiment to wonder about what might be possible if even current state of the art models were available at like, a million tokens per second at a very low cost.
Modem vs Claude according to Claude:
300 @ 2368 characters - 1m 19s
1200 @ 2368 characters - 19.7s
2400 @ 2368 characters - 9.9s
14.4K @ 2368 characters - 1.6s
33.6K @ 2368 characters - 705 ms
56K @ 2368 characters - 447 ms
Claude @ 2368 characters - 7.9s
https://docs.nvidia.com/megatron-core/developer-guide/0.15.0...
> The draft models seamlessly utilize the target model's activations and share its KV cache, meaning they don't have to waste time recalculating context the larger model has already figured out.
I tried first with Qwen but it was unstable and had ridiculously long thinning traces!
They serve gemma-4-26b-a4b-it.
Gemma 4 26B-A4B is much quicker on my setup vs Qwen3.6-35B-A3B (by about 3x), so the thought of a 1.5 speedup is tantalizing.
Have tried draft models to limited success (the smaller 3B draft model in addition to a dense 14B Ministral model introduced too much overhead already)
For gemma4 26B, same quantization, I get >200TPS.
Also note that qwen is extremely inefficient in reasoning; the reasoning chains are ~3x longer than gemma on average
However, it is a little painful to try to fit the best possible version into 24GB vram with vision + this drafter soon. My build doesn't support any more GPUs and I believe I would want another 4090 (overpriced) for best performance or otherwise just replace it altogether.
google/gemma-4-31B-it-assistant
google/gemma-4-26B-A4B-it-assistant
google/gemma-4-E4B-it-assistant
google/gemma-4-E2B-it-assistant
Gemma:31b was more accurate but speed was horrendous.
You can try it out with Ollama 0.23.1 by running `ollama run gemma4:31b-coding-mtp-bf16`.
It's not uncommon to see a gemma vs qwen comparison, where qwen does a bit better, but spent 22 minutes on the task, while gemma aligned the buttons wrong, but only spent 4 minutes on the same prompt. So taken at face value, gemma is now under performing leading open models by 5-10%, but doing it in 1/10th the time.
Caveat: Gemini has been dumbed down a few times over the last year. Rate limits tightened up too. So it might not be this good in the future.
mchusma•1h ago
Farmadupe•1h ago
Might be easier to chuck it over the fence and let other providers handle it as it'll run in almost any commercial grade card?
Also speculating, but I wonder if it might also create a bit of a pricing problem relative to Gemini flashlight depending on serving cost and quality of outputs?
As a comparison, despite being SotA for their size, the smallest qwen models on openrouter (27b and 35b) are not at all worth using, as there are way bigger and better models for less oricemon a per token basis
disiplus•1h ago
dakolli•1h ago
disiplus•1h ago
Farmadupe•8m ago
And even alibaba's own qwen3.6-plus is $1.95, so it's kinda easy to come to a conclusion that alibaba (nor anyone else) is really interested in hosting that model.
And don't get me wrong, I fully agree with you, qwen3.6 27b is an amazing model. I run it on my own hardware and every day I'm constantly surprised with what it can zero shot.
Havoc•1h ago
https://www.youtube.com/watch?v=sXgZhGzqPmU
As for why cloud offer it - think it's just an effort to promote the brand. The gemmas are pretty small so they can host it without it being a major drain on the company. They have the infra anyway
nolist_policy•51m ago
WarmWash•19m ago
If Gemma 4 is less lucrative than Claude to the Google Cloud kingdom, the Cloud kingdom will want you using Claude.