I had 30 broken Playwright tests and no way to tell which ones actually mattered. The problem wasn’t “fix the tests” — it was that there’s no coverage tool for test infrastructure trustworthiness. I had to build the ruler before I could measure anything.
So I wrote a file that defined a composite metric (four weighted components → one score), an improvement loop, and constraints. Pointed Claude at it. Went to bed. Woke up to 12 commits, 47 → 83.
The file became GOAL.md. The insight that surprised me: most software doesn’t have a natural scalar metric like val_bpb. You have to construct it. Documentation quality, API trustworthiness, test infrastructure confidence — these things have no pytest –cov equivalent. But once you build the ruler, the autoresearch loop works on them too.
The part I’m most uncertain about: the “dual score” pattern. When the agent is building its own measuring tools, it can game the metric by weakening the instrument. So the docs-quality example has two scores — one for the docs, one for the linter itself. The agent has to improve the telescope before it can use it. I think this is load-bearing but I’d love to hear if others have found different solutions to the same problem.
Easiest way to try it: paste this into Claude Code, Cursor, or any coding agent and point it at one of your repos:
Read github.com/jmilinovich/goal-md — read the template and examples.
Then write me a GOAL.md for this repo and start working on it.
Happy to hear what breaks. The scoring script is bash + jq so it’s not exactly production-grade, and the examples are biased toward the kinds of projects I work on. More examples from different domains would make the pattern sharper.
jmilinovich•2h ago
So I wrote a file that defined a composite metric (four weighted components → one score), an improvement loop, and constraints. Pointed Claude at it. Went to bed. Woke up to 12 commits, 47 → 83.
The file became GOAL.md. The insight that surprised me: most software doesn’t have a natural scalar metric like val_bpb. You have to construct it. Documentation quality, API trustworthiness, test infrastructure confidence — these things have no pytest –cov equivalent. But once you build the ruler, the autoresearch loop works on them too.
The part I’m most uncertain about: the “dual score” pattern. When the agent is building its own measuring tools, it can game the metric by weakening the instrument. So the docs-quality example has two scores — one for the docs, one for the linter itself. The agent has to improve the telescope before it can use it. I think this is load-bearing but I’d love to hear if others have found different solutions to the same problem.
Easiest way to try it: paste this into Claude Code, Cursor, or any coding agent and point it at one of your repos:
Read github.com/jmilinovich/goal-md — read the template and examples. Then write me a GOAL.md for this repo and start working on it.
Happy to hear what breaks. The scoring script is bash + jq so it’s not exactly production-grade, and the examples are biased toward the kinds of projects I work on. More examples from different domains would make the pattern sharper.