The problem: Chatbot Arena tests conversation quality. But most people using AI agents need them to do more: browse the web, manage files, write and run code, create full applications, automate multi-step workflows. There's no benchmark that (1) tests general-purpose agentic tasks, (2) uses user-submitted tasks instead of fixed test sets, and (3) separately ranks models on both quality and cost-effectiveness.
What we built: OpenClaw Arena lets you submit any task and pit 2-5 models against each other. A judge OpenClaw agent (currently using one of the top models: Claude Opus 4.6, GPT-5.4, or Gemini 3.1 Pro) runs on a fresh VM, spawns one subagent per model, and each model solves the task independently with full access to terminal, browser, file system, and code execution.
Results feed into two live leaderboards:
- Performance — which model produces the best results
- Cost-effectiveness — which model delivers the best quality per dollar
What we've found (after 300+ battles, 15 models):
The two rankings are completely different. Performance top 3: Claude Opus 4.6, GPT-5.4, Claude Sonnet 4.6. Cost-effectiveness top 3: Step 3.5 Flash, Grok 4.1 Fast, MiniMax M2.7.
Claude Opus 4.6 ranks #1 on performance but #14 on cost-effectiveness.
Step 3.5 Flash is #1 on cost-effectiveness, #5 on performance. (I didn't expect that TBH)
Several models (GLM-5 Turbo, Xiaomi MiMo v2 Pro, MiniMax M2.7) outrank Gemini 3.1 Pro on performance. Actually Gemini 3.1 Pro is so bad at using skills that we have to optimize the judge message just for it, otherwise it sometimes just reads the skill and decide to do nothing...
Note: we bootstrap first 300 battles by crawling what people are doing using OpenClaw (on X, Reddit, etc), and generate battles with similar tasks + randomly selected models.
Methodology: We only use the relative ordering of models within each battle to compute rankings — not the raw scores. Same principle as Chatbot Arena: absolute scores from judges are noisy and poorly calibrated (a "7/10" in one battle might be "6/10" in another), but "A ranked above B" is much more consistent and reliable. Rankings use a grouped Plackett-Luce model (not simple win-rate or Bradley-Terry) with 1,000-resample bootstrap confidence intervals. Each model entry shows score ± CI and a rank spread (plausible rank range). Models with insufficient data are marked "provisional." Full methodology with equations: https://app.uniclaw.ai/arena/leaderboard/methodology?via=hn
Key features:
- Live dual leaderboard (performance + cost-effectiveness) with Plackett-Luce ranking
- Dynamic user-submitted tasks across 11 categories (no fixed test set to overfit on), we will add more, just let me know what you want to add
- Fresh VM per benchmark with one subagent per model
- User-selectable judge model
- Full conversation history, judge reasoning, and workspace artifacts preserved and shown to users
- Full transparency: you can evaluate the output yourself, not just trust the score
- Open-source judge skill: https://github.com/unifai-network/skills/tree/main/agent-ben...
Public benchmarks are free (we cover compute). The leaderboard is browsable without an account.
- Leaderboard: https://app.uniclaw.ai/arena?via=hn
- Submit a battle: https://app.uniclaw.ai/arena/new?via=hn (free account required)
- Methodology: https://app.uniclaw.ai/arena/leaderboard/methodology?via=hn
- Judge skill source: https://github.com/unifai-network/skills/tree/main/agent-ben...
We'd love feedback on the methodology and what tasks you'd want to see benchmarked.