The MCP spec doesn't have tool versioning available yet, and there's no static artifact describing what a server exposes. The tools/list just returns whatever's in memory at runtime and there's nothing to commit or diff against, which means changes slip through that can break downstream workflows without noticing.
The same problem for HTTP was already solved a long time ago with VCR.py, and I realized the same pattern works here. mcp-recorder captures the full MCP interaction sequence — initialize, tools/list, tools/call — into a JSON cassette file. Because it records complete protocol exchanges rather than just schema snapshots, you're testing actual behavior: if a tool call that used to return a specific format now returns something different, or a capability quietly disappears during the handshake, the cassette catches it. From that single recording you can replay it as a mock server (no API keys, fully deterministic), or verify your changed server against it and catch any diff:
Verifying golden.json against node dist/index.js
1. initialize [PASS]
2. tools/list [PASS]
3. tools/call [search] [FAIL]
$.result.content[0].text: "old output" != "new output"
4. tools/call [analyze] [PASS]
Result: 3/4 passed, 1 failedNon-zero exit code on any mismatch, so it plugs straight into CI.
You can try it right now with minimal setup, there's a public demo server and a scenarios file included:
pip install mcp-recorder mcp-recorder record-scenarios scenarios.yml mcp-recorder verify --cassette cassettes/demo_walkthrough.json \ --target https://mcp.devhelm.io
It works with both HTTP and stdio transports. Scenarios are defined in YAML so it works with MCP servers in any language, and there's a pytest plugin if you want tighter integration. Secret redaction and environment variable interpolation are built in.
To make sure this actually works on real codebases, I submitted several PRs to production MCP servers: monday.com's MCP server (https://github.com/mondaycom/mcp/pull/222), Tavily's MCP server (https://github.com/tavily-ai/tavily-mcp/pull/113), and Firecrawl's MCP server (https://github.com/firecrawl/firecrawl-mcp-server/pull/175). They went from zero schema coverage to full tool surface verification with a clean schema diff available on each tool change. One big benefit is that you can do verification and replay with no API keys — deterministic responses, no live requests to real servers.
I wrote up a deeper dive into the schema drift problem and the VCR pattern for MCP here: https://devhelm.io/blog/regression-testing-mcp-servers
mcp-recorder is MIT-licensed and on PyPI. Source is at https://github.com/devhelmhq/mcp-recorder — issues and PRs are welcome.
I'm building more tooling around MCP and agent reliability, so if you're dealing with similar problems, I'd genuinely like to hear what's been painful for you.