frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Mcpbr – does your MCP help? Test it on SWE-bench and 25 evals

https://github.com/greynewell/mcpbr
2•greynewell•1h ago
mcpbr runs a Claude Code agent on real-world tasks (once with your MCP server, once without) in identical Docker containers and records hard numbers that tell you whether your tools actually helped.

I built this because my team ships an MCP server and we had no way to know if it actually made agents better. We tried running SWE-bench directly and through OpenHands, but both assume you're evaluating the agent itself instead of the tools you give it. We couldn't run the same task with and without our server in a controlled environment, and when things broke inside Docker we had no visibility into what went wrong. I wanted a framework that treats MCP server evaluation as a first-class problem.

Here's how mcpbr works at a high level. It orchestrates pre-built Docker images from Epoch AI so environments are reproducible. Then it runs Claude Code CLI inside the container in headless mode. Finally, it evaluates one of 25+ benchmarks through an abstracted protocol, allowing a new benchmark to be added with ~100 lines of code. SWE-bench alone provides 2,294 test cases across real repos like Django, scikit-learn, and astropy.

Using mcpbr does come with a few trade-offs. It's currently Claude-focused, though other harnesses are in development. Evaluations are also kinda expensive ($50-200 for 25 tasks). Finally, it's a bit slow (2-4 hours for a full run). These are not accidents but conscious decisions I felt were worth reproducible, controlled measurement including full logs and traces where none existed before.

Try it: ```bash pip install mcpbr && mcpbr init && mcpbr run -c mcpbr.yaml -n 1 -v ```

I'd love to hear which benchmarks matter most to you, and whether the A/B comparison format (MCP vs baseline) gives you the data you need."

I put AoE II sounds in my Claude Code Worktree/Sandbox Manager and it's glorious

https://www.agent-of-empires.com/docs/sounds.html
2•river_otter•2m ago•1 comments

Scaling markets with non-human operators

https://selectfromwhereand.com/musings/scaling_operators/
2•iamsam123•9m ago•1 comments

Show HN: Wikipedia as a doomscrollable social media feed

https://xikipedia.org
3•rebane2001•11m ago•0 comments

Artemis II: A Step Towards Permanent Human Activity Beyond Low Earth Orbit

https://www.realcleardefense.com/articles/2026/01/31/artemis_ii_a_step_towards_permanent_human_ac...
1•Gaishan•13m ago•0 comments

Oracle to Raise Up to $50B This Year for Cloud Investment

https://www.bloomberg.com/news/articles/2026-02-01/oracle-to-raise-up-to-50-billion-this-year-for...
2•zerosizedweasle•15m ago•2 comments

The Physics of Glitches: Analyzing 'The Backrooms' as a Systems Failure

https://misssandwich.substack.com/p/the-yellow-perversion-of-the-real-eed
1•misssandwich•15m ago•1 comments

We built an AI sysadmin that works (and won't delete /usr)

https://github.com/goshops-com/opsagent
2•sjcotto•22m ago•1 comments

Time Machine-style Backups with rsync (2018)

https://samuelhewitt.com/blog/2018-06-05-time-machine-style-backups-with-rsync
1•accrual•24m ago•0 comments

VoidLink: The Cloud-Native Malware Framework Weaponizing Linux Infrastructure

https://blog.checkpoint.com/research/voidlink-the-cloud-native-malware-framework-weaponizing-linu...
1•PaulHoule•26m ago•0 comments

Testing your fit for policy careers (2024)

https://emergingtechpolicy.org/essentials/policy-fit-testing/
2•jstrieb•26m ago•0 comments

It's All About the Pixel Economy

https://cvalenzuelab.com/pixel-economy
1•nsm•26m ago•0 comments

Before ChatGPT-HW debate there were other "If students use X to do HW" debates

https://blog.computationalcomplexity.org/2026/02/before-chatgpt-hw-debate-there-were.html
1•zdw•27m ago•0 comments

Selfhosted Bible PWA

https://mobilebible.net/
2•PaxSubChristo•28m ago•3 comments

Otava: Change Detection for Continuous Performance Engineering

https://github.com/apache/otava
1•tanelpoder•28m ago•0 comments

History and Timeline of the Proco Rat Pedal (2021)

https://web.archive.org/web/20211030011207/https://thejhsshow.com/articles/history-and-timeline-o...
2•brudgers•30m ago•1 comments

Show HN: I made a voice cloning Discord bot

https://copykitten.gg/
1•TheSaltySeaCow•32m ago•0 comments

Two kinds of AI users are emerging. The gap between them is astonishing

https://martinalderson.com/posts/two-kinds-of-ai-users-are-emerging/
1•martinald•38m ago•0 comments

How One Line of Python Triggers 12,000 Lines of Code [video]

https://www.youtube.com/watch?v=5B6W2OGfxq0
1•thunderbong•49m ago•0 comments

Show HN: Cut Your Pinecone Bill by 50% (Open Source Cost Auditor)

https://github.com/billycph/VectorDBCostSavingInspector
1•billycph•49m ago•0 comments

Aliasing and the Heisenberg Uncertainty Principle

http://blog.sigfpe.com/2013/01/aliasing-and-heisenberg-uncertainty.html
2•wtrm•50m ago•0 comments

Automatic Epstein file downloader [video]

https://www.youtube.com/watch?v=D0TX1zGOO9U
4•xecaz•52m ago•2 comments

Your Deepest Value Is Adaption

https://www.overcomingbias.com/p/your-deepest-value-is-adaption
1•jger15•55m ago•0 comments

Kanjideck: The full walkthrough from zero to launch

https://alt-romes.github.io/posts/2026-01-30-from-side-project-to-kickstarter-a-walkthrough.html
4•romes•55m ago•0 comments

A heterogeneous population code at the first synapse of vision

https://www.nature.com/articles/s41467-026-68757-x
2•bookofjoe•1h ago•0 comments

Show HN: Dungeon-1, a Zork-style text adventure built with constrained LLMs

https://dungeonminusone.com/login.html
1•jwproj•1h ago•0 comments

Zombie (Album, 1976)

https://en.wikipedia.org/wiki/Zombie_(album)
2•defrost•1h ago•0 comments

We (As a Society) Peaked in the 90s

https://chris.pagecord.com/we-as-a-society-peaked-in-the-90s
34•stog•1h ago•43 comments

Show HN: Specmark – annotate Markdown for AI feedback

https://specmark.dev/
1•jlbrooks•1h ago•0 comments

Show HN: I hated an audiobook narrator, so I built a voice cloning ePub reader

https://github.com/jarodise/ClonEpub-Pocket
1•jarodise•1h ago•0 comments

Decomp Dev

https://decomp.dev/projects
1•aizk•1h ago•1 comments