What sets it apart: Unlike benchmarks like SWE-Bench (which tests code generation on open-ended GitHub issues) or general agent evaluation suites (that mix diverse reasoning, coding, and interaction tasks), Tracecore focuses on deterministic episodes where agents must use constrained actions (e.g., file operations, ops triage) to achieve exact outcomes, with strict validation. It includes 15+ tasks across suites like operations and games, and supports running agents via adapters for frameworks like OpenClaw and Autogen, or custom scripts.
You can try it out by installing through pip/uv, or by cloning the repo and installing the optional dev dependencies, and running the dashboard, the cli wizard or the cli commands. It outputs structured results with success/failure, steps used, traces for analysis, diffs, bundles and more.
I've been iterating on this over the past few weeks, adding new tasks and improving the harness. Previous discussions on AI eval tools were helpful in shaping the design. Feedback welcome, especially on expanding task suites or integration ideas.