Show HN: AIBenchy – Independent AI Leaderboard

https://aibenchy.com

1•XCSme•1h ago

Hey HN, Like many of you, I'm tired of public AI leaderboards that mostly recycle the same saturated/overfitted benchmarks (MMLU, HumanEval, etc.) and often miss fast/cheap variants or real daily pain points.

A couple days ago I launched AIBenchy — a small, opinionated leaderboard running my own custom tests focused on end-user/dev scenarios that actually trip up models today.

Current tests cover categories like:

- Anti-AI Tricks (classic gotchas like "count the Rs in strawberry", logic traps)

- Instruction following & consistency

- Data parsing/extraction

- Domain-specific tasks

- Puzzle solving / edge-case reasoning

Recent additions (just pushed today):

- Reasoning score (new!): A separate judge LLM evaluates the chain-of-thought for efficiency — does it repeat itself, loop, think forever, brute-force enumerate every possibility (looking at you, some Qwen-3.5 runs), or get to the point cleanly? This penalizes "cheaty" high-token reasoning even if the final answer is correct. Goal: reward smart, concise thinking over exhaustive trial-and-error.

- Stability metric: Measures consistency across runs (some models flake on the same prompt).

Right now the leaderboard has ~20 models (Qwen3.5 Plus currently topping it, followed by GLM 5, various GPT/Claude variants, etc.), but it's super early/WIP:

- Manual runs + small test set - No public submission of tests yet (open to ideas!) - Focused on transparency & practical usefulness over massive scale

I'd love feedback from HN:

- What custom tests / gotchas / use-cases should I add next?

- Thoughts on the reasoning score — fair way to judge efficiency, or too subjective?

- Models/variants I'm missing (especially fast/cheap ones ignored elsewhere)?

- Should I let people submit their own prompts/tests eventually?

Thanks for checking it out: https://aibenchy.com

Appreciate any roast/ideas — building this to scratch my own itch.

Comments

XCSme•1h ago

Offtopic: Dang, I'm fighting with the HackerNews formatting. Anyone has a link to the HN formatting guide?

Show HN: QuickStaging – AI Virtual Staging tool for real estate agents

From Claude Code to Figma

Wi-Fi 7's Best Feature Doesn't Work (Yet) [video]

Show HN: StatusPing – Uptime monitoring for $9/mo

Pidgin Plugins

How the Olympics Are Mixed Live 4k Miles Away [video]

Meshcore IRC Bridge

Copper-rs the deterministic OS for robotics gets full observability

Show HN: Verified 16.7M Mac chip architecture on $60 Android phone

Multi-Language MCP Server Performance Benchmark

Stop building generic AI chatbots: 45% of support leaders are ahead

A Local-Algebraic Route to Emergent Gravity (100 Pages)

Cultivating Praxia

Managing Docker Composes via GitOps

An update on upki

Google trying to recover footage from other Guthrie home cameras

Way to Understand the Irish Economy

Mature Cultural Desire

Technology has changed the world in my lifetime

Evolution of Computers [video]

OpenClaw Partners with VirusTotal for Skill Security

Show HN: AI pentester – verified exploits, $999/assessment

PEP 814 – Add frozendict built-in type

Show HN: Rot – Financial Intelligence MCP Server

Unauthorized Immigration Effects on Local Labor Markets

ChatGPT promised to help her find her soulmate. Then it betrayed her

A fluid can store solar energy and then release it as heat months later

GLM-5 Technical Report

Learning Low-Level Computing and C++ by Making a Game Boy Emulator

I Built a Roguelike RPG Card Game with Compose Multiplatform