But that's completely wrong.
An agent can get the right answer through the wrong path. It can hallucinate in intermediate steps but still reach the correct conclusion. It can violate constraints while technically achieving the goal.
Traditional ML metrics (accuracy, precision, recall) miss all of this because they only look at the final output.
I've been experimenting with a different approach: using the agent's system prompt as ground truth, evaluating the entire trajectory (not just the final output), and using multi-dimensional scoring (not just a single metric).
The results are night and day. Suddenly I can see hallucinations, constraint violations, inefficient paths, and consistency issues that traditional metrics completely missed.
Am I crazy? Or is the entire industry evaluating agents wrong?
I'd love to hear from others who are building agents. How are you evaluating them? What problems have you run into?