The problem: Existing benchmarks use synthetic problems. I wanted to know which LLM is best at MY actual code challenges.
How it works:
• Submit code + describe your task ("refactor this", "find security issues", etc.)
• 6 models solve it in parallel: GPT-5, Claude Opus 4.1, Claude Sonnet 4.5, Grok 4, Gemini 2.5 Pro, o3
• AI judge scores each solution (correctness, security, performance, etc.)
• You vote on the real winner
• Public leaderboard shows which models actually win on real-world tasks
Currently have 10 evaluations live (100% vote completion rate). Early patterns emerging:
• GPT-5 leads overall with 40% win rate (4/10 wins)
• Gemini 2.5 Pro dominates security tasks
• GPT-5 strongest at refactoring
• Claude Sonnet 4.5 at optimization tasks
Queue system keeps costs predictable ($10/day = 15 free evaluations for the community).
Free during beta - would love your feedback!