Ask HN: Agent evaluations, what is everything I should know?
3•akira_067•2mo ago
I'm currently building coding agents, and wondering what the standard is for creating and running evals for most people? I gather that the tasks and their definitions will be dramatically different across domains and instances, so I'm not hoping for a one size fits all. Just... what actually works for you in practice?
Comments
adastra22•2mo ago
The capabilities of the tool matter more. Claude Code, Codex, Cursor CLI all have different feature sets. This usually determines the choice more than base model capabilities.
esafak•2mo ago
They felt similar enough to me. One feature that did make a difference to me is when follow up prompts are acted on; sooner or later.
adastra22•2mo ago
If all you are doing is chatting with the model, they're effectively the same. If you are actually building agentic workflows, they differ immensely in supported features. Codex doesn't support subagent personalities, for example, and until this week Claude Code didn't support generating structured responses.
akira_067•2mo ago
You think so? What features are differentiated enough to warrant that?
Seems like they all have tools for reading files, editing, running shell commands.
Cursor has linting access, but this can be added to claude with hooks I think.
adastra22•2mo ago
Subagent personalities, skills, structured output generation, customizable context cleanup, etc.
akira_067•2mo ago
Most of those seem kind of useless... I probably do not know how to use them correctly, so, what would you recommend by way of subagents and skills?
adastra22•2mo ago
esafak•2mo ago
adastra22•2mo ago
akira_067•2mo ago
Seems like they all have tools for reading files, editing, running shell commands.
Cursor has linting access, but this can be added to claude with hooks I think.
adastra22•2mo ago
akira_067•2mo ago