I think higher performance will be a key differentiator in AI tool quality from a user perspective, especially in use-cases where model quality is already sufficiently good for human-in-loop usage.
NOTE: Currently in OpenRouter, Qwen3-Coder requests are averaging to $0.3/1M input tok and $1.2/1M output tok. That's just so significantly cheaper that I wouldn't be surprised if open weight models start eating Google/Anthropic/OpenAI lunch. https://openrouter.ai/qwen/qwen3-coder
For code generation, this does seem pretty useful with something like Qwen3-Coder-480B, if that generates good enough code for your purposes.
But for chat, I wonder: does this kind of speed call for models that behave pretty differently to current ones? With virtually instant speed, I find myself wanting much shorter answers sometimes. Maybe a model whose design and training are focused on concision and a context with lots and lots of turns would be a uniquely useful option with this kind of hardware.
But I guess the hardware is really for training, right, and the inference-as-a-service stuff is basically a powerful form of marketing?
retreatguru•11h ago
I'd like to try this out: use Claude Code as the interface, setup claude-code-router to connect to Cerebras Qwen3 coder and see 20x speed up. The speed difference might make up for the slightly less intelligence compared to Sonnet or Opus.
I don't see Qwen3 Coder available yet on Open Router https://openrouter.ai/provider/cerebras
retreatguru•9h ago
gnulinux•7h ago