How it works: - Upload code + describe task (refactoring, security review, architecture, etc.)
- All 6 models run in parallel (~2-5 min)
- See side-by-side comparison with AI judge scores
- Community votes on winners (blind voting)
- Each evaluation gets reflected in the overall AI model leaderboard, showing us best ones
Why I built this: Existing benchmarks (HumanEval, SWE-Bench) don't reflect real-world developer tasks. I wanted to know which model actually solves MY specific problems - refactoring legacy TypeScript, reviewing React components, etc. It's also similar to LMArena, but their evaluations are not entirely transparent.
Current status:
- Live at https://codelens.ai
- 23 evaluations so far (small sample, I know!)
- Free tier processes 3 evals per day (first-come, first-served queue)
- Looking for real tasks to make the benchmark meaningful
- Happy to answer questions about the tech stack, cost structure, or methodology.
Currently in validation stage. What are your first impressions?