They're somehow connected to vision & block speculative decode...don't ask me how/why though
For gemma specifically had more luck with speculative using the llama-server route than lm studio
The performance uplift on local/self-hosted models in both quality and speed has been amazing in the last few months.
The current implementation ignores that head but the PR let the tool recognize it, plus does proper integration (run the MTP while running the slower main path then compare the result, I believe.)
Some of the work in that direction like Cerebras or Taalas have been doing is an interesting glimpse of what might be possible. In the meantime it's a fun thought experiment to wonder about what might be possible if even current state of the art models were available at like, a million tokens per second at a very low cost.
https://docs.nvidia.com/megatron-core/developer-guide/0.15.0...
I tried first with Qwen but it was unstable and had ridiculously long thinning traces!
They serve gemma-4-26b-a4b-it.
mchusma•1h ago
Farmadupe•58m ago
Might be easier to chuck it over the fence and let other providers handle it as it'll run in almost any commercial grade card?
Also speculating, but I wonder if it might also create a bit of a pricing problem relative to Gemini flashlight depending on serving cost and quality of outputs?
As a comparison, despite being SotA for their size, the smallest qwen models on openrouter (27b and 35b) are not at all worth using, as there are way bigger and better models for less oricemon a per token basis
disiplus•46m ago
dakolli•40m ago
disiplus•17m ago
Havoc•26m ago
https://www.youtube.com/watch?v=sXgZhGzqPmU
As for why cloud offer it - think it's just an effort to promote the brand. The gemmas are pretty small so they can host it without it being a major drain on the company. They have the infra anyway