Each skill is a structured SKILL.md. After every run, an LLM judge scores each skill and tags exact failures. An LLM patcher generates candidate fixes to just the failing section. Each candidate is replayed on past traces. Winner gets promoted. Loser discarded.
One command: evoagents autofix
Key decisions: - LLM-as-judge, not regex — constraints are natural language, evaluation should be too - Section-level patching — only the broken part gets touched - Replay gating — no patch ships without proving it improves on real data - Versioned — every change is a new version, instant rollback
You can steer it: evoagents autofix --guide "prefer primary sources"
pip install evoagents https://github.com/jatingargiitk/evoagents