What do you think about measuring agentic AI in practice. A few weeks ago I read something on Anthropic’s blog on evals for AI agents https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents, and then yesterday saw this on Medium https://medium.com/quantumblack/evaluations-for-the-agentic-world-c3c150f0dd5a Feels like this is becoming a thing.
Anthropic talk about how to structure agent evals and what they’ve learned from running these internally. The QuantumBlack post clooks more programmatic or lifecyle focussed. How evals need to change once agents are combined and using tools. What to do whenthey're deployed, and how to factor them in early.
Curious what peeple are doing in the real world. Are you rolling your own task suites? Using offline + online evals? Or mostly vibes and logs for now?