I get the feeling I'm being slightly propagandized in this comment.
But. That's just me, my pessimism-sci-fi scenario.
I’d also reread the HN guidelines
Claude is not cheap, why is it far and away the most popular if it's not top 10 in performance?
Qwen3 235b ranks highest on these benchmarks among open models, but I have never met someone who prefers its output over Deepseek R1. It's extremely wordy and often gets caught in thought loops.
My interpretation is that the models at the top of ArtificialAnalysis are focusing the most on public benchmarks in their training. Note I am not saying XAI is necessarily nefariously doing this, could just be that they decided it's better bang for the buck to rely on public benchmarks than to try to focus on building their own evaluation systems.
But Grok is not very good compared to the anthropic, openai, or google models despite ranking so highly in benchmarks.
For example, Google's inexplicable design decisions around libraries and APIs means it's often worth the 5% premium to just use OpenRouter to access their models. In other cases it's about which models particular agents default to.
Sonnet 4 is extremely good for tool-usage agentic setups though - something I have found other models struggle to do over a long-context.
=====
LiveCodeBench
E4B IT: 13.2
Qwen: 55.2
===== AIME25
E4B IT: 11.6
Qwen: 81.3
The new qwen3 model is not out yet.
gok•1h ago
smallerize•9m ago