Two commands: /evo:discover figures out what to measure in your repo, instruments the eval, runs baseline. /evo:optimize runs the loop.
Evo is built on top of Karpathy's autoresearch (https://github.com/karpathy/autoresearch) with some structure bolted on. Karpathy's version is a greedy hill climb on a single branch. Evo is what happens if you run tree search instead, so multiple directions can fork from any committed node. An orchestrator spawns N subagents in parallel, each in its own git worktree, each with its own iteration budget. They read each other's failure traces before forming a hypothesis, which is the interesting bit.
Next up is running Evo on real evals and benchmarks. Rollouts on frontier models add up fast, so if you're at a lab, cloud provider, or RL-env shop and can back an open-source benchmark run with GPU / API / env credits, please reach out (socials on Github). Results are public. Happy to shape the targets around domains you'd want highlighted.
Would also love feedback, especially from anyone who's built similar loops.