frontpage.

I got tired of picking LLMs based on vibes and leaderboards that don't reflect real workloads, so I built this.

You describe a task in plain English. The tool generates a test suite for that specific task, discovers candidate models via OpenRouter, benchmarks them in parallel, and uses a Judge LLM to score every response across 5 dimensions: accuracy, hallucination, grounding, tool-calling, and clarity.

Output is a ranked top 3 with average latency per model and a task-specific system prompt optimized for the winner.

A few things I learned while building it:

- Score and latency rarely correlate. The best model for accuracy on coding tasks was almost never the fastest. This tradeoff is completely task-dependent and impossible to see from benchmarks that don't reflect your workload. - The Judge LLM approach is surprisingly consistent but introduces positional and familiarity bias. Using one model to score others isn't perfect, but it's far more reproducible than manual eval. Open to ideas on how to reduce judge bias without blowing up the cost. - Model discovery matters more than I expected. The top performers on generic benchmarks often weren't the top performers on narrow tasks.

Stack: Python, OpenRouter for model access, MIT licensed.

https://github.com/gauravvij/llm-evaluator

Happy to answer questions on the design decisions.

Show HN: Jotio – Temporary notes that archive themselves

Polsia claims a $3M ARR and streams all user chats via public URL

Minimal NixOS systemd-nspawn containers

No, your six-year-old cannot paint this

AutoSkill: Experience-Driven Lifelong Learning via Skill Self-Evolution

Show HN: AriaType – Privacy-first voice keyboard with AI polish (Beta, macOS)

Removing recursion via explicit callstack simulation

AMD Expands Ryzen AI Embedded P100 Family with 8 to 12 Core Parts – ServeTheHome

I infiltrated phishing panels targeting European banks

Nvidia 595 Linux Driver Running Well in Early Benchmarks

I packed an Interactive 3D Encyclopedia into 1MB

3D printer that can mine Bitcoin uses excess heat for temperature control

ChatGPT driving rise in reports of 'satanic' organised and ritual abuse

Show HN: A step debugger for AI agents

SaaS Changelogs, 5 Months: What I Found About How B2B Companies Ship

Show HN: Skilo – Share agent skills with a link, no repo required

A New Spy Radio Signal Has Appeared. It's Broadcasting in Farsi

Revealed: UK's multibillion AI drive is built on 'phantom investments'

We ran 21 MCP database tasks on Claude Sonnet 4.6

Show HN: Argus – Self-hosted Ethereum security monitor

Show HN: Screen-watching AI needs a kill switch

Corpus Christi careens toward water catastrophe

India offered sanctuary to Iranian ship three days before US sank it

Do AI-enabled companies need fewer people?

Show HN: AI workflows for SoC analysts (phishing analysis, log triage)

How to Host Your Own Email Server

Post-Quantum Cryptography Beyond TLS: Remain Quantum Safe

My Experiment with GitHub Sponsors

Do AI coding agents improve velocity and quality?

The first AI agent worm is months away, if that

Show HN: Auto LLM Ranker – Describe a task in English and get ranked models