It was free for a long time. That usually skews the statistics. It was the same with grok-code-fast1.
If you haven’t heard of it yet there’s some good discussion here: https://news.ycombinator.com/item?id=47069179
- https://huggingface.co/stepfun-ai/Step-3.5-Flash-Base
- https://huggingface.co/stepfun-ai/Step-3.5-Flash-Base-Midtra...
I'm not aware of other AI labs that released base checkpoint for models in this size class. Qwen released some base models for 3.5, but the biggest one is the 35B checkpoint.
They also released the entire training pipeline:
- https://huggingface.co/datasets/stepfun-ai/Step-3.5-Flash-SF...
skysniper•1h ago
The two boards look nothing alike. Top 3 performance: Claude Opus 4.6, GPT-5.4, Claude Sonnet 4.6. Top 3 cost-effectiveness: StepFun 3.5 Flash, Grok 4.1 Fast, MiniMax M2.7.
The most dramatic split: Claude Opus 4.6 is #1 on performance but #14 on cost-effectiveness. StepFun 3.5 Flash is #1 cost-effectiveness, #5 performance.
Other surprises: GLM-5 Turbo, Xiaomi MiMo v2 Pro, and MiniMax M2.7 all outrank Gemini 3.1 Pro on performance.
Rankings use relative ordering only (not raw scores) fed into a grouped Plackett-Luce model with bootstrap CIs. Same principle as Chatbot Arena — absolute scores are noisy, but "A beat B" is reliable. Full methodology: https://app.uniclaw.ai/arena/leaderboard/methodology?via=hn
I built this as part of OpenClaw Arena — submit any task, pick 2-5 models, a judge agent evaluates in a fresh VM. Public benchmarks are free.
refulgentis•1h ago
skysniper•1h ago
gemini is very unreliable at using skills, often just read skills and decide to do nothing.
stepfun leads cost-effectiveness leaderboard.
ranking really depends on tasks, better try your own task.
refulgentis•1h ago
skysniper•1h ago
refulgentis•29m ago
skysniper•13m ago
Yes, judge is one of opus 4.6, gpt 5.4, gemini 3.1 pro (submitter can choose). Self judge (judge model is also one of the participants) is excluded when computing ranking.
> There's lot of references to "just like LMArena", but LMArena is human evaluated?
Yeah LMArena is human evaluated, but here i found it not practical to gather enough human evaluation data because the effort it take to compare the result is much higher:
- for code, judge needs to read through it to check code quality, and actually run it to see the output
- when producing a webpage or a document, judge needs to check the content and layout visually
- when anything goes wrong, judge needs to read the execution log to see whether partial credit shall be granted
if you look at the cost details of each battle (available at the bottom of battle detail page), judge typically cost more than any participant model.
if we evaluate with human, i would say each evaluation can easily take ~5-10 min
refulgentis•9m ago
Thanks for replying btw, didn't mean any disrespect, good on you for not getting aggro about feedback
rat9988•54m ago
refulgentis•31m ago
Maybe? :)
> There are many others that are okay with it
Correct.
> and it doesn't disminish the quality of the work.
It does affect incoming people hearing about the work.
I applaud your instinct to defend someone who put in effort. It's one of the most important things we can do.
Another important thing we can do for them is be honest about our own reactions. It's not sunshine and rainbows on its face, but, it is generous. Mostly because A) it takes time B) other people might see red and harangue you for it.
johndough•6m ago