When I transferred 237 of those corrections as rules to a new agent to save time with onboarding in a new repo, it made 44 new mistakes. 13 were in categories the rules explicitly covered. The rules were present in context. The behavior wasn't there. I published the field study with full correction logs.
Then Meta's Superintelligence Labs published HyperAgents (arXiv:2603.19461, March 2026). They found the complementary result: improvements DO transfer across domains when embodied in executable mechanisms (persistent memory, performance tracking, eval loops), not when written as rule text. Two independent studies, same boundary: documentation is not behavior.
So I built Calx. pip install getcalx gives you a CLI + MCP server that:
Captures corrections developers make to AI agents Detects recurrence via keyword similarity (Jaccard), auto-promotes at 3x threshold Promotes recurring corrections to enforced rules and hooks, injected at session start Scopes rules per domain/directory so each agent gets only what's relevant
It runs as a FastMCP server over Streamable HTTP (SQLite locally) so any MCP-compatible client connects: Claude Code, Claude Desktop, Cursor, custom agents. It is primarily designed for Claude Code. It also handles token discipline (prevents context compaction from destroying correction signal), multi-agent orchestration, session lifecycle hooks, orientation gates, and dirty-exit recovery.
The difference from agent memory tools: existing agent memory systems store information for retrieval. Calx tracks the behavioral plane, how an agent works with a specific person, not just what it knows. The data shows the information plane alone doesn't reliably change behavior.
v0.5.0, 443 tests, MIT license. Paper with full evidence: https://doi.org/10.5281/zenodo.19159223
spenceships•1h ago
The origin was accidental. I was building a startup (AI career translation platform), not running an experiment. The correction logs were just how I managed the agents.
When the transfer failed, honestly it didn't occur to me that I had measured it at all until well after. I was pivoting the platform to go fully agentic and had burned through 1.9B tokens in 4 days or something. So, I did an audit to see what fell through the cracks. The audit was when I began realizing what I had found. At that point the paper just made sense, because I hadn't seen anyone else talking about it.
What surprised me the most: architectural corrections (changing how something is structured) had zero recurrence. Process corrections ("always do X before Y") had roughly 50% persistence, with recurring failure chains. One correction chain went eight entries deep, each referencing the previous ones. The agent kept making the same category of mistake with slight variations.
HyperAgents landing the same week I was writing this up was genuinely lucky timing, and I didn't find out about it until last week. In my opinion, their imp@50 = 0.630 on math (where traditional transfer scored 0.0) is the clearest evidence that the mechanism vs documentation distinction is real and measurable.
What I'd love feedback on:
Is the MCP server the right distribution mechanism, or do people want this as IDE plugins? I have always strongly believed in meeting people where they are when it comes to Open Source, but I'm curious what this community thinks The recurrence detection uses Jaccard similarity on keyword sets. This is simple and works for my data, but I suspect it breaks on large teams. Anyone have experience with correction clustering at scale? The paper methodology is N=1. HyperAgents converged on the boundary but it doesn't account for everything. I know the limitations. If anyone wants to replicate with their own correction logs, the framework is designed for it and I'd actively help. I am quite eager to have people mess around with the tool and let me know their thoughts
As a note, I am still in the process of shipping the hook and orchestration methodology to work with the MCP server, and at the time of this writing I'm about a third of the way through the build. Am hoping to have it live and packaged by morning EST.
Happy to answer questions about the correction dynamics, the MCP architecture, or anything in the paper.