A real engineering experiment can run for hours. Along the way, the agent reads files, runs commands, checks logs, compares metrics, tries ideas that fail, and needs to remember what already happened. Once context starts slipping, it forgets the goal, loses track of the baseline, and retries bad ideas.
Remoroo is my attempt to solve that problem.
You point it at a repo and give it a measurable goal. It runs locally, tries changes, executes experiments, measures the result, keeps what helps, and throws away what does not.
A big part of the system is memory. Long runs generate far more context than a model can hold, so I built a demand-paging memory system inspired by OS virtual memory to keep the run coherent over time.
There is a technical writeup here: https://www.remoroo.com/blog/how-remoroo-works
Would love feedback from people working on long-running agents, training loops, eval harnesses, or similar workflows.