Ask HN: Agent evaluations, what is everything I should know?
3•akira_067•1h ago
I'm currently building coding agents, and wondering what the standard is for creating and running evals for most people? I gather that the tasks and their definitions will be dramatically different across domains and instances, so I'm not hoping for a one size fits all. Just... what actually works for you in practice?
Comments
adastra22•36m ago
The capabilities of the tool matter more. Claude Code, Codex, Cursor CLI all have different feature sets. This usually determines the choice more than base model capabilities.
adastra22•36m ago