frontpage.

Show HN: OpenClaw Arena – Benchmark models on real tasks, rank by perf and cost

https://app.uniclaw.ai/arena?via=hn

2•skysniper•1h ago

We built an arena for comparing AI models on real agentic tasks — not chat or static benchmarks. Models run as actual OpenClaw subagent in fresh VMs with full tool access, and results feed into two separate leaderboards: performance and cost-effectiveness.

The problem: Chatbot Arena tests conversation quality. But most people using AI agents need them to do more: browse the web, manage files, write and run code, create full applications, automate multi-step workflows. There's no benchmark that (1) tests general-purpose agentic tasks, (2) uses user-submitted tasks instead of fixed test sets, and (3) separately ranks models on both quality and cost-effectiveness.

What we built: OpenClaw Arena lets you submit any task and pit 2-5 models against each other. A judge OpenClaw agent (currently using one of the top models: Claude Opus 4.6, GPT-5.4, or Gemini 3.1 Pro) runs on a fresh VM, spawns one subagent per model, and each model solves the task independently with full access to terminal, browser, file system, and code execution.

Results feed into two live leaderboards:

- Performance — which model produces the best results

- Cost-effectiveness — which model delivers the best quality per dollar

What we've found (after 300+ battles, 15 models):

The two rankings are completely different. Performance top 3: Claude Opus 4.6, GPT-5.4, Claude Sonnet 4.6. Cost-effectiveness top 3: Step 3.5 Flash, Grok 4.1 Fast, MiniMax M2.7.

Claude Opus 4.6 ranks #1 on performance but #14 on cost-effectiveness.

Step 3.5 Flash is #1 on cost-effectiveness, #5 on performance. (I didn't expect that TBH)

Several models (GLM-5 Turbo, Xiaomi MiMo v2 Pro, MiniMax M2.7) outrank Gemini 3.1 Pro on performance. Actually Gemini 3.1 Pro is so bad at using skills that we have to optimize the judge message just for it, otherwise it sometimes just reads the skill and decide to do nothing...

Note: we bootstrap first 300 battles by crawling what people are doing using OpenClaw (on X, Reddit, etc), and generate battles with similar tasks + randomly selected models.

Methodology: We only use the relative ordering of models within each battle to compute rankings — not the raw scores. Same principle as Chatbot Arena: absolute scores from judges are noisy and poorly calibrated (a "7/10" in one battle might be "6/10" in another), but "A ranked above B" is much more consistent and reliable. Rankings use a grouped Plackett-Luce model (not simple win-rate or Bradley-Terry) with 1,000-resample bootstrap confidence intervals. Each model entry shows score ± CI and a rank spread (plausible rank range). Models with insufficient data are marked "provisional." Full methodology with equations: https://app.uniclaw.ai/arena/leaderboard/methodology?via=hn

Key features:

- Live dual leaderboard (performance + cost-effectiveness) with Plackett-Luce ranking

- Dynamic user-submitted tasks across 11 categories (no fixed test set to overfit on), we will add more, just let me know what you want to add

- Fresh VM per benchmark with one subagent per model

- User-selectable judge model

- Full conversation history, judge reasoning, and workspace artifacts preserved and shown to users

- Full transparency: you can evaluate the output yourself, not just trust the score

- Open-source judge skill: https://github.com/unifai-network/skills/tree/main/agent-ben...

Public benchmarks are free (we cover compute). The leaderboard is browsable without an account.

- Leaderboard: https://app.uniclaw.ai/arena?via=hn

- Submit a battle: https://app.uniclaw.ai/arena/new?via=hn (free account required)

- Methodology: https://app.uniclaw.ai/arena/leaderboard/methodology?via=hn

- Judge skill source: https://github.com/unifai-network/skills/tree/main/agent-ben...

We'd love feedback on the methodology and what tasks you'd want to see benchmarked.

Ask HN: What is your favorite quote from a book?

GNU Parallel citation request now asks you cite "Epstein files"

Does AI work feel a bit too habit-forming?

Agentic AI Engineering Workflows for iOS in 2026

Navigating the POC Valley

Google Gemini may adapt AI answers to match user tone: Report

Euro Office

Full-Scale Ultrasonic Clothes Washer/Dryer Evaluation Results (2025)[pdf]

Show HN: Ironedome Commander – Israel/Iran War Arcade

Italy, Spain set solar records in March

Sauver: An open-source, local AI agent that fights email slop

Show HN: The 42-Day Vibe – A mockumentary on "Vibe Coding" taken to the extreme

Show HN: Multi-agent autoresearch for ANE inference beats Apple's CoreML by 6×

Jehovah's Witnesses Sue Museum for Archive of Nazi-Era Abuses – NYT

LiteLLM post-mortem: when the scanner runs the attack

United Airlines Orion Design System

Show HN: Hey PMs, let's give you a fighting chance!

We Have a Shortlist for Finding Life Beyond Earth

Casio the Special One Calculator

Show HN: Cerno – CAPTCHA that targets LLM reasoning, not human biology

Law Firms Prefer Cubicles to Cubicle Dwellers

Encoding Team Standards

Show HN: ModelAtlas – Find AI models that HuggingFace search can't

Noyb win: Microsoft ordered to stop tracking school children

Kevin Rose Back at Digg

Jami – free/libre, end-to-end encrypted, and private communication software

Show HN: I adapted codex-plugin-cc's design for Gemini CLI's ACP

Major Claude Code source leak offers deep insight into how Anthropic tool works

Why Inventing Color TV Was So Difficult [video]

After 16 years and $8B, military new GPS software still doesn't work