I built this because my team ships an MCP server and we had no way to know if it actually made agents better. We tried running SWE-bench directly and through OpenHands, but both assume you're evaluating the agent itself instead of the tools you give it. We couldn't run the same task with and without our server in a controlled environment, and when things broke inside Docker we had no visibility into what went wrong. I wanted a framework that treats MCP server evaluation as a first-class problem.
Here's how mcpbr works at a high level. It orchestrates pre-built Docker images from Epoch AI so environments are reproducible. Then it runs Claude Code CLI inside the container in headless mode. Finally, it evaluates one of 25+ benchmarks through an abstracted protocol, allowing a new benchmark to be added with ~100 lines of code. SWE-bench alone provides 2,294 test cases across real repos like Django, scikit-learn, and astropy.
Using mcpbr does come with a few trade-offs. It's currently Claude-focused, though other harnesses are in development. Evaluations are also kinda expensive ($50-200 for 25 tasks). Finally, it's a bit slow (2-4 hours for a full run). These are not accidents but conscious decisions I felt were worth reproducible, controlled measurement including full logs and traces where none existed before.
Try it: ```bash pip install mcpbr && mcpbr init && mcpbr run -c mcpbr.yaml -n 1 -v ```
I'd love to hear which benchmarks matter most to you, and whether the A/B comparison format (MCP vs baseline) gives you the data you need."