frontpage.

mcpbr runs a Claude Code agent on real-world tasks (once with your MCP server, once without) in identical Docker containers and records hard numbers that tell you whether your tools actually helped.

I built this because my team ships an MCP server and we had no way to know if it actually made agents better. We tried running SWE-bench directly and through OpenHands, but both assume you're evaluating the agent itself instead of the tools you give it. We couldn't run the same task with and without our server in a controlled environment, and when things broke inside Docker we had no visibility into what went wrong. I wanted a framework that treats MCP server evaluation as a first-class problem.

Here's how mcpbr works at a high level. It orchestrates pre-built Docker images from Epoch AI so environments are reproducible. Then it runs Claude Code CLI inside the container in headless mode. Finally, it evaluates one of 25+ benchmarks through an abstracted protocol, allowing a new benchmark to be added with ~100 lines of code. SWE-bench alone provides 2,294 test cases across real repos like Django, scikit-learn, and astropy.

Using mcpbr does come with a few trade-offs. It's currently Claude-focused, though other harnesses are in development. Evaluations are also kinda expensive ($50-200 for 25 tasks). Finally, it's a bit slow (2-4 hours for a full run). These are not accidents but conscious decisions I felt were worth reproducible, controlled measurement including full logs and traces where none existed before.

Try it: ```bash pip install mcpbr && mcpbr init && mcpbr run -c mcpbr.yaml -n 1 -v ```

I'd love to hear which benchmarks matter most to you, and whether the A/B comparison format (MCP vs baseline) gives you the data you need."

I put AoE II sounds in my Claude Code Worktree/Sandbox Manager and it's glorious

Scaling markets with non-human operators

Show HN: Wikipedia as a doomscrollable social media feed

Artemis II: A Step Towards Permanent Human Activity Beyond Low Earth Orbit

Oracle to Raise Up to $50B This Year for Cloud Investment

The Physics of Glitches: Analyzing 'The Backrooms' as a Systems Failure

We built an AI sysadmin that works (and won't delete /usr)

Time Machine-style Backups with rsync (2018)

VoidLink: The Cloud-Native Malware Framework Weaponizing Linux Infrastructure

Testing your fit for policy careers (2024)

It's All About the Pixel Economy

Before ChatGPT-HW debate there were other "If students use X to do HW" debates

Selfhosted Bible PWA

Otava: Change Detection for Continuous Performance Engineering

History and Timeline of the Proco Rat Pedal (2021)

Show HN: I made a voice cloning Discord bot

Two kinds of AI users are emerging. The gap between them is astonishing

How One Line of Python Triggers 12,000 Lines of Code [video]

Show HN: Cut Your Pinecone Bill by 50% (Open Source Cost Auditor)

Aliasing and the Heisenberg Uncertainty Principle

Automatic Epstein file downloader [video]

Your Deepest Value Is Adaption

Kanjideck: The full walkthrough from zero to launch

A heterogeneous population code at the first synapse of vision

Show HN: Dungeon-1, a Zork-style text adventure built with constrained LLMs

Zombie (Album, 1976)

We (As a Society) Peaked in the 90s

Show HN: Specmark – annotate Markdown for AI feedback

Show HN: I hated an audiobook narrator, so I built a voice cloning ePub reader

Decomp Dev

Show HN: Mcpbr – does your MCP help? Test it on SWE-bench and 25 evals