frontpage.

Show HN: mcp-recorder – VCR.py for MCP servers. Record, replay, verify

https://github.com/devhelmhq/mcp-recorder

6•caballeto•8h ago

Hi HN, I'm Vlad. I've been building MCP servers and related tooling for a while now, and I kept hitting a class of bug that no unit test caught: someone on the team renames a tool parameter or tweaks a tool description, all the tests pass, but the AI agent that was calling that tool silently breaks. This happens because the model reads tool descriptions and parameter schemas to decide which tool to call and how, so a renamed parameter or a reworded description isn't just a cosmetic change — it directly affects the model's behavior.

The MCP spec doesn't have tool versioning available yet, and there's no static artifact describing what a server exposes. The tools/list just returns whatever's in memory at runtime and there's nothing to commit or diff against, which means changes slip through that can break downstream workflows without noticing.

The same problem for HTTP was already solved a long time ago with VCR.py, and I realized the same pattern works here. mcp-recorder captures the full MCP interaction sequence — initialize, tools/list, tools/call — into a JSON cassette file. Because it records complete protocol exchanges rather than just schema snapshots, you're testing actual behavior: if a tool call that used to return a specific format now returns something different, or a capability quietly disappears during the handshake, the cassette catches it. From that single recording you can replay it as a mock server (no API keys, fully deterministic), or verify your changed server against it and catch any diff:

Verifying golden.json against node dist/index.js

  1. initialize [PASS]

  2. tools/list [PASS]

  3. tools/call [search] [FAIL]

    $.result.content[0].text: "old output" != "new output"

  4. tools/call [analyze] [PASS]

Result: 3/4 passed, 1 failed

Non-zero exit code on any mismatch, so it plugs straight into CI.

You can try it right now with minimal setup, there's a public demo server and a scenarios file included:

pip install mcp-recorder mcp-recorder record-scenarios scenarios.yml mcp-recorder verify --cassette cassettes/demo_walkthrough.json \ --target https://mcp.devhelm.io

It works with both HTTP and stdio transports. Scenarios are defined in YAML so it works with MCP servers in any language, and there's a pytest plugin if you want tighter integration. Secret redaction and environment variable interpolation are built in.

To make sure this actually works on real codebases, I submitted several PRs to production MCP servers: monday.com's MCP server (https://github.com/mondaycom/mcp/pull/222), Tavily's MCP server (https://github.com/tavily-ai/tavily-mcp/pull/113), and Firecrawl's MCP server (https://github.com/firecrawl/firecrawl-mcp-server/pull/175). They went from zero schema coverage to full tool surface verification with a clean schema diff available on each tool change. One big benefit is that you can do verification and replay with no API keys — deterministic responses, no live requests to real servers.

I wrote up a deeper dive into the schema drift problem and the VCR pattern for MCP here: https://devhelm.io/blog/regression-testing-mcp-servers

mcp-recorder is MIT-licensed and on PyPI. Source is at https://github.com/devhelmhq/mcp-recorder — issues and PRs are welcome.

I'm building more tooling around MCP and agent reliability, so if you're dealing with similar problems, I'd genuinely like to hear what's been painful for you.

Plan 9 from User Space

GPT-5.4 code-golfs GPT-2

Re-creating the complex cuisine of prehistoric Europeans

Oracle and OpenAI drop Texas data center expansion plan

Palera1n Jailbreak Compiled and Run on a Samsung Galaxy S3 (PostmarketOS, ARMv7)

Eval awareness in Claude Opus 4.6's BrowseComp performance

Show HN: I built an international calling platform/Android App

If flip-phones can make a comeback, can Flash do the same?

An AI disaster is getting ever closer

Ecological Imperialism

Python 'Chardet' Package Replaced with LLM-Generated Clone, Re-Licensed

Asana: Scaling our invalidation pipeline (Part 1)

Host Claude Artifacts on your own domain

Issue: The Consciousness Question Is Being Asked Wrong

Obstructive sleep apnoea costs UK and US economies £137B a year

Show HN: GPT-5.4 is interesting for one boring reason: fewer retries

Jank is off to a great start in 2026

Swift at scale: building the TelemetryDeck analytics service

GLP-1 drugs may fight addiction across every major substance

Watch BYD's 5-min Flash Charging in action on the new Seal 07 EV

Reflections on Using Acme (2020)

Show HN: Graph-Oriented Generation – Beating RAG for Codebases by 89%

Most of My Coding Is Now Agentic

Eating out of boredom isn't a thing

Claude Used to Hack Mexican Government

The Evolution of Go (2015) [video]

3W for In-Browser AI: WebLLM and WASM and WebWorkers

New (early) diabetes cure in China

Project Oberon Emulator in JavaScript and Java

White House autism briefing linked to Swift shifts in prescribing patterns