As I work on AI agents, I find myself constantly thinking about how to effectively test them.
As we integrate more knowledge sources and expand our agents' capabilities, testing becomes increasingly complex. As a standard practice, we use evals to ensure quality is maintained. But honestly, I feel like something is missing.
The issue I’m seeing is that we, as engineers, sometimes lack sufficient domain knowledge to assess an agent's response accurately. At the same time, current tooling limits the possibility of collaborating with domain experts to perform testing together. For example, current tooling gives priority to dashboards over the readability of actual outcomes
This has been my experience so far—I would love to hear your thoughts on this.
Comments
falcor84•1h ago
Your question is really abstract. Maybe give an explanation of your domain, the tooling you're currently using and its particular limitations?
Also worth saying that the priority given to metrics and dashboards over actual outcomes is a fundamental issue for any structured activity (see e.g. Goodhart's law and Campbell's Law), and has little to do with AI.
falcor84•1h ago
Also worth saying that the priority given to metrics and dashboards over actual outcomes is a fundamental issue for any structured activity (see e.g. Goodhart's law and Campbell's Law), and has little to do with AI.