This kind of benchmark completely misses that nuance.
It just prevents hallucinations and coerces the AI to use existing files and APIs instead of inventing them. It also has gold-standard tests and APIs as examples.
Before the agents file, it was just chaos of hallucinations and having to correct it all the time with the same things.
The good: It shows on one kind of benchmark, some kinds of agentically-generated don't help. So naively generating these, for one kind of task, doesn't work - useful to know!
The bad: Some people assume this means in general these don't work, or automation here doesn't work.
The truth: These files help measurably and just a bit of engineering enables you to guarantee that for the typical case. As soon as you have an objective function, you can flip it into an eval, and set an AI coder to editing these files until they work. Ex: We recently released https://github.com/graphistry/graphistry-skills for more easily using graphistry via AI coding, and by having our authoring AI loop a bit with our evals, we jumped the scores from 30-50% success rate to 90%+. As we encounter more scenarios (and mine them from our chats etc), it's pretty straight forward to flip them into evals and ask Claude/Codex to loop until those work well too.
verdverm•1h ago
AGENTS.md are extremely helpful if done well.
lucketone•59m ago