frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: OpenClaw Arena – Benchmark models on real tasks, rank by perf and cost

https://app.uniclaw.ai/arena?via=hn
2•skysniper•1h ago
We built an arena for comparing AI models on real agentic tasks — not chat or static benchmarks. Models run as actual OpenClaw subagent in fresh VMs with full tool access, and results feed into two separate leaderboards: performance and cost-effectiveness.

The problem: Chatbot Arena tests conversation quality. But most people using AI agents need them to do more: browse the web, manage files, write and run code, create full applications, automate multi-step workflows. There's no benchmark that (1) tests general-purpose agentic tasks, (2) uses user-submitted tasks instead of fixed test sets, and (3) separately ranks models on both quality and cost-effectiveness.

What we built: OpenClaw Arena lets you submit any task and pit 2-5 models against each other. A judge OpenClaw agent (currently using one of the top models: Claude Opus 4.6, GPT-5.4, or Gemini 3.1 Pro) runs on a fresh VM, spawns one subagent per model, and each model solves the task independently with full access to terminal, browser, file system, and code execution.

Results feed into two live leaderboards:

- Performance — which model produces the best results

- Cost-effectiveness — which model delivers the best quality per dollar

What we've found (after 300+ battles, 15 models):

The two rankings are completely different. Performance top 3: Claude Opus 4.6, GPT-5.4, Claude Sonnet 4.6. Cost-effectiveness top 3: Step 3.5 Flash, Grok 4.1 Fast, MiniMax M2.7.

Claude Opus 4.6 ranks #1 on performance but #14 on cost-effectiveness.

Step 3.5 Flash is #1 on cost-effectiveness, #5 on performance. (I didn't expect that TBH)

Several models (GLM-5 Turbo, Xiaomi MiMo v2 Pro, MiniMax M2.7) outrank Gemini 3.1 Pro on performance. Actually Gemini 3.1 Pro is so bad at using skills that we have to optimize the judge message just for it, otherwise it sometimes just reads the skill and decide to do nothing...

Note: we bootstrap first 300 battles by crawling what people are doing using OpenClaw (on X, Reddit, etc), and generate battles with similar tasks + randomly selected models.

Methodology: We only use the relative ordering of models within each battle to compute rankings — not the raw scores. Same principle as Chatbot Arena: absolute scores from judges are noisy and poorly calibrated (a "7/10" in one battle might be "6/10" in another), but "A ranked above B" is much more consistent and reliable. Rankings use a grouped Plackett-Luce model (not simple win-rate or Bradley-Terry) with 1,000-resample bootstrap confidence intervals. Each model entry shows score ± CI and a rank spread (plausible rank range). Models with insufficient data are marked "provisional." Full methodology with equations: https://app.uniclaw.ai/arena/leaderboard/methodology?via=hn

Key features:

- Live dual leaderboard (performance + cost-effectiveness) with Plackett-Luce ranking

- Dynamic user-submitted tasks across 11 categories (no fixed test set to overfit on), we will add more, just let me know what you want to add

- Fresh VM per benchmark with one subagent per model

- User-selectable judge model

- Full conversation history, judge reasoning, and workspace artifacts preserved and shown to users

- Full transparency: you can evaluate the output yourself, not just trust the score

- Open-source judge skill: https://github.com/unifai-network/skills/tree/main/agent-ben...

Public benchmarks are free (we cover compute). The leaderboard is browsable without an account.

- Leaderboard: https://app.uniclaw.ai/arena?via=hn

- Submit a battle: https://app.uniclaw.ai/arena/new?via=hn (free account required)

- Methodology: https://app.uniclaw.ai/arena/leaderboard/methodology?via=hn

- Judge skill source: https://github.com/unifai-network/skills/tree/main/agent-ben...

We'd love feedback on the methodology and what tasks you'd want to see benchmarked.

Ask HN: What is your favorite quote from a book?

1•chistev•30s ago•0 comments

GNU Parallel citation request now asks you cite "Epstein files"

https://cgit.git.savannah.gnu.org/cgit/parallel.git/tree/src/parallel#n6118
1•LeoPanthera•34s ago•0 comments

Does AI work feel a bit too habit-forming?

https://thedayninja.substack.com/p/why-im-paying-attention-to-ai-overload
1•emotf•1m ago•1 comments

Agentic AI Engineering Workflows for iOS in 2026

https://blog.jacobstechtavern.com/p/agentic-ai-2026
1•jakey_bakey•2m ago•0 comments

Navigating the POC Valley

https://deploy95.substack.com/p/navigating-the-poc-valley
1•dddddaviddddd•4m ago•0 comments

Google Gemini may adapt AI answers to match user tone: Report

https://searchengineland.com/google-gemini-tone-emotions-report-473118
1•semking•4m ago•1 comments

Euro Office

https://github.com/Euro-Office
1•esher•4m ago•0 comments

Full-Scale Ultrasonic Clothes Washer/Dryer Evaluation Results (2025)[pdf]

https://ttu-ir.tdl.org/server/api/core/bitstreams/2a8b4b32-781b-4fbc-a769-fdf861f92c93/content
1•kelseyfrog•4m ago•0 comments

Show HN: Ironedome Commander – Israel/Iran War Arcade

https://irondomecommander.com
1•cassiepaper•5m ago•0 comments

Italy, Spain set solar records in March

https://www.pv-magazine.com/2026/03/31/italy-spain-set-solar-records-in-march/
1•vrganj•6m ago•1 comments

Sauver: An open-source, local AI agent that fights email slop

https://sauver.org
1•mszczodrak•7m ago•1 comments

Show HN: The 42-Day Vibe – A mockumentary on "Vibe Coding" taken to the extreme

https://open.spotify.com/episode/6RuxZkgUxXXjrT66FlXeme
1•codekidX•7m ago•0 comments

Show HN: Multi-agent autoresearch for ANE inference beats Apple's CoreML by 6×

https://www.ensue-network.ai/lab/ane
3•christinetyip•7m ago•0 comments

Jehovah's Witnesses Sue Museum for Archive of Nazi-Era Abuses – NYT

https://www.nytimes.com/2022/01/25/arts/design/jehovahs-witnesses-nazis-lawsuit-museum.html
2•janandonly•8m ago•0 comments

LiteLLM post-mortem: when the scanner runs the attack

https://www.bluerock.io/post/litellm-supply-chain-protection
1•BlueRock-Jake•9m ago•0 comments

United Airlines Orion Design System

https://www.onenorth.com/work/united-airlines/
1•skogstokig•12m ago•0 comments

Show HN: Hey PMs, let's give you a fighting chance!

https://usecentel.com/
3•marcel-felix•12m ago•0 comments

We Have a Shortlist for Finding Life Beyond Earth

https://www.seti.org/news/we-finally-have-a-shortlist-for-finding-life-beyond-earth/
2•u1hcw9nx•13m ago•0 comments

Casio the Special One Calculator

https://www.youtube.com/watch?v=YTwESBcgoyQ
1•skogstokig•15m ago•0 comments

Show HN: Cerno – CAPTCHA that targets LLM reasoning, not human biology

https://cerno.sh
2•plawlost•15m ago•0 comments

Law Firms Prefer Cubicles to Cubicle Dwellers

https://b2bs.substack.com/p/law-firms-prefer-cubicles-to-cubicle
1•oopsiremembered•18m ago•0 comments

Encoding Team Standards

https://martinfowler.com/articles/reduce-friction-ai/encoding-team-standards.html
1•saikatsg•18m ago•0 comments

Show HN: ModelAtlas – Find AI models that HuggingFace search can't

https://github.com/rohanvinaik/ModelAtlas
1•Vinaik•18m ago•1 comments

Noyb win: Microsoft ordered to stop tracking school children

https://noyb.eu/en/noyb-win-microsoft-ordered-stop-tracking-school-children
3•jruohonen•19m ago•0 comments

Kevin Rose Back at Digg

https://www.kevinrose.com/p/rebooting-everything
2•minkeymaniac•20m ago•0 comments

Jami – free/libre, end-to-end encrypted, and private communication software

https://jami.net/
2•smartmic•21m ago•0 comments

Show HN: I adapted codex-plugin-cc's design for Gemini CLI's ACP

1•abiswas97•22m ago•0 comments

Major Claude Code source leak offers deep insight into how Anthropic tool works

https://arstechnica.com/ai/2026/03/entire-claude-code-cli-source-code-leaks-thanks-to-exposed-map...
4•johnbarron•24m ago•0 comments

Why Inventing Color TV Was So Difficult [video]

https://www.youtube.com/watch?v=hyjCmIbRRvs
1•DamnInteresting•26m ago•0 comments

After 16 years and $8B, military new GPS software still doesn't work

https://arstechnica.com/space/2026/03/after-16-years-and-8-billion-the-militarys-new-gps-software...
3•johnbarron•26m ago•1 comments