IMO this likely is what you get from running the model correctly as-is (i.e. using the same weight and activation dtype), so Together is not bad.
Moonshot AI themselves and Groq likely uses some sampler tricks to eliminate schema validation errors.
So really the only thing this shows is: Nebius, Chutes, AtlasCloud could be running something else (for example further quantized model). Or bugs.
Anyway, Novita is doing significantly better on the vendor verifier chart than Together, so the low quality must be partially Together's fault at least.
TIL. Bit of an aha moment - never understood till now how a big model can verify faster than it can generate
Cool hack though, kudos. Wonder if they can make Groq or Cerebras do the same thing?
petesergeant•2h ago
and yet, if you click on: https://openrouter.ai/moonshotai/kimi-k2-0905
You'll see Groq averaging 1,086tps vs Together doing 59tps. Groq and Cerebras often feel like the only games in town. I'd love that to be different (because I'd like more models!), but nobody else is coming close right now.
Comparing how quickly gpt-oss-120b runs gives a broader picture: https://openrouter.ai/openai/gpt-oss-120b -- Vertex (Google) and SambaNova do pretty good on it too, but still, the difference between a top provider and an also-ran is giant.
God I love OpenRouter.
senko•2h ago
What I don't understand is, Groq reporting 200tps for the same model: https://console.groq.com/docs/model/moonshotai/kimi-k2-instr...
OpenRouter numbers look fishy.
p1esk•2h ago
petesergeant•23m ago
Havoc•2h ago
SambaNova should be similar...they've got a similar specialized hardware approach
jbellis•2h ago
bn-l•1h ago
immortal3•2h ago
jsheard•2h ago
Not just custom chips, but custom chips which derive much of their performance from enormous amounts of SRAM. There's no denying that approach is fast, but it's also incredibly expensive, and SRAM scaling has slowed to a crawl so it won't get much cheaper any time soon.
rfoo•1h ago
AFAIU, speculative decoding (and this fancier version of spec. decoding) does not reduce accuracy.
martinald•1h ago
meander_water•2h ago
[0] https://openrouter.ai/moonshotai/kimi-k2-0905/performance
awestroke•2h ago
KronisLV•49m ago
I'm currently on the Cerebras Code subscription for like 50 USD a month because it more or less makes the rate limits I used to deal with other platforms disappear (without making me spend upwards of 100 USD paying per token): https://www.cerebras.ai/blog/introducing-cerebras-code
At the same time, their Qwen Coder 480B model is fine but I still find myself going for Claude or GPT-5 or Gemini 2.5 Pro for more complex issues (or ones where I need good usage of Latvian language), at least for programming tasks it'd eventually be super cool if they could offer more models.
Or have some sort of a partnership with Anthropic or whoever, because getting my questions answered at around 500-1500 TPS is really, really pleasant, especially for agentic use cases with code modifications, even if I still bump into the 128k context limits occasionally.