frontpage.

I've been building AI agents for the past year, and I've noticed something troubling: everyone I talk to is evaluating their agents the same way—by looking at the final output and asking "Is it correct?"

But that's completely wrong.

An agent can get the right answer through the wrong path. It can hallucinate in intermediate steps but still reach the correct conclusion. It can violate constraints while technically achieving the goal.

Traditional ML metrics (accuracy, precision, recall) miss all of this because they only look at the final output.

I've been experimenting with a different approach: using the agent's system prompt as ground truth, evaluating the entire trajectory (not just the final output), and using multi-dimensional scoring (not just a single metric).

The results are night and day. Suddenly I can see hallucinations, constraint violations, inefficient paths, and consistency issues that traditional metrics completely missed.

Am I crazy? Or is the entire industry evaluating agents wrong?

I'd love to hear from others who are building agents. How are you evaluating them? What problems have you run into?

The Rise of Spec Driven Development

The first good Raspberry Pi Laptop

Seas to Rise Around the World – But Not in Greenland

Will Future Generations Think We're Gross?

Kernel Key Retention Service

State Department will delete Xitter posts from before Trump returned to office

Show HN: Verifiable server roundtrip demo for a decision interruption system

Impl Rust – Avro IDL Tool in Rust via Antlr

Stories from 25 Years of Software Development

minikeyvalue

Neomacs: GPU-accelerated Emacs with inline video, WebKit, and terminal via wgpu

Show HN: Moli P2P – An ephemeral, serverless image gallery (Rust and WebRTC)

How I grow my X presence?

What's the cost of the most expensive Super Bowl ad slot?

What if you just did a startup instead?

Hacking up your own shell completion (2020)

Show HN: Gorse 0.5 – Open-source recommender system with visual workflow editor

GLM-OCR: Accurate × Fast × Comprehensive

Local Agent Bench: Test 11 small LLMs on tool-calling judgment, on CPU, no GPU

Show HN: AboutMyProject – A public log for developer proof-of-work

Expertise, AI and Work of Future [video]

So Long to Cheap Books You Could Fit in Your Pocket

PID Controller

SpaceX Rocket Generates 100GW of Power, or 20% of US Electricity

Kubernetes MCP Server

I Built a Movie Recommendation Agent to Solve Movie Nights with My Wife

What were the first animals? The fierce sponge–jelly battle that just won't end

Sidestepping Evaluation Awareness and Anticipating Misalignment

OldMapsOnline

What It's Like to Be a Worm

Are we evaluating AI agents all wrong?