A couple days ago I launched AIBenchy — a small, opinionated leaderboard running my own custom tests focused on end-user/dev scenarios that actually trip up models today.
Current tests cover categories like:
- Anti-AI Tricks (classic gotchas like "count the Rs in strawberry", logic traps)
- Instruction following & consistency
- Data parsing/extraction
- Domain-specific tasks
- Puzzle solving / edge-case reasoning
Recent additions (just pushed today):
- Reasoning score (new!): A separate judge LLM evaluates the chain-of-thought for efficiency — does it repeat itself, loop, think forever, brute-force enumerate every possibility (looking at you, some Qwen-3.5 runs), or get to the point cleanly? This penalizes "cheaty" high-token reasoning even if the final answer is correct. Goal: reward smart, concise thinking over exhaustive trial-and-error.
- Stability metric: Measures consistency across runs (some models flake on the same prompt).
Right now the leaderboard has ~20 models (Qwen3.5 Plus currently topping it, followed by GLM 5, various GPT/Claude variants, etc.), but it's super early/WIP:
- Manual runs + small test set - No public submission of tests yet (open to ideas!) - Focused on transparency & practical usefulness over massive scale
I'd love feedback from HN:
- What custom tests / gotchas / use-cases should I add next?
- Thoughts on the reasoning score — fair way to judge efficiency, or too subjective?
- Models/variants I'm missing (especially fast/cheap ones ignored elsewhere)?
- Should I let people submit their own prompts/tests eventually?
Thanks for checking it out: https://aibenchy.com
Appreciate any roast/ideas — building this to scratch my own itch.
XCSme•1h ago