Show HN: Calx – track and compile corrections humans make with AI agents

https://github.com/getcalx

1•spenceships•1h ago

Last year I got laid off and started building a company. Fast forward to a month ago, I built a production system with 6 AI agents across 82,000 lines of code in 20 days for $250. I kept obsessive correction logs. Every time an agent made a mistake and I told it what to do differently, and I made sure it logged the correction itself.

When I transferred 237 of those corrections as rules to a new agent to save time with onboarding in a new repo, it made 44 new mistakes. 13 were in categories the rules explicitly covered. The rules were present in context. The behavior wasn't there. I published the field study with full correction logs.

Then Meta's Superintelligence Labs published HyperAgents (arXiv:2603.19461, March 2026). They found the complementary result: improvements DO transfer across domains when embodied in executable mechanisms (persistent memory, performance tracking, eval loops), not when written as rule text. Two independent studies, same boundary: documentation is not behavior.

So I built Calx. pip install getcalx gives you a CLI + MCP server that:

Captures corrections developers make to AI agents Detects recurrence via keyword similarity (Jaccard), auto-promotes at 3x threshold Promotes recurring corrections to enforced rules and hooks, injected at session start Scopes rules per domain/directory so each agent gets only what's relevant

It runs as a FastMCP server over Streamable HTTP (SQLite locally) so any MCP-compatible client connects: Claude Code, Claude Desktop, Cursor, custom agents. It is primarily designed for Claude Code. It also handles token discipline (prevents context compaction from destroying correction signal), multi-agent orchestration, session lifecycle hooks, orientation gates, and dirty-exit recovery.

The difference from agent memory tools: existing agent memory systems store information for retrieval. Calx tracks the behavioral plane, how an agent works with a specific person, not just what it knows. The data shows the information plane alone doesn't reliably change behavior.

v0.5.0, 443 tests, MIT license. Paper with full evidence: https://doi.org/10.5281/zenodo.19159223

Comments

spenceships•1h ago

Here's some context on how this happened:

The origin was accidental. I was building a startup (AI career translation platform), not running an experiment. The correction logs were just how I managed the agents.

When the transfer failed, honestly it didn't occur to me that I had measured it at all until well after. I was pivoting the platform to go fully agentic and had burned through 1.9B tokens in 4 days or something. So, I did an audit to see what fell through the cracks. The audit was when I began realizing what I had found. At that point the paper just made sense, because I hadn't seen anyone else talking about it.

What surprised me the most: architectural corrections (changing how something is structured) had zero recurrence. Process corrections ("always do X before Y") had roughly 50% persistence, with recurring failure chains. One correction chain went eight entries deep, each referencing the previous ones. The agent kept making the same category of mistake with slight variations.

HyperAgents landing the same week I was writing this up was genuinely lucky timing, and I didn't find out about it until last week. In my opinion, their imp@50 = 0.630 on math (where traditional transfer scored 0.0) is the clearest evidence that the mechanism vs documentation distinction is real and measurable.

What I'd love feedback on:

Is the MCP server the right distribution mechanism, or do people want this as IDE plugins? I have always strongly believed in meeting people where they are when it comes to Open Source, but I'm curious what this community thinks The recurrence detection uses Jaccard similarity on keyword sets. This is simple and works for my data, but I suspect it breaks on large teams. Anyone have experience with correction clustering at scale? The paper methodology is N=1. HyperAgents converged on the boundary but it doesn't account for everything. I know the limitations. If anyone wants to replicate with their own correction logs, the framework is designed for it and I'd actively help. I am quite eager to have people mess around with the tool and let me know their thoughts

As a note, I am still in the process of shipping the hook and orchestration methodology to work with the MCP server, and at the time of this writing I'm about a third of the way through the build. Am hoping to have it live and packaged by morning EST.

Happy to answer questions about the correction dynamics, the MCP architecture, or anything in the paper.

Mad Bugs: Claude Wrote a Full FreeBSD Remote Kernel RCE with Root Shell

China can survive without the Strait of Hormuz

Beyond Bestsellers: How We're Teaching IKEA's Recommender to Think Differently

Claude Code Leaks

Mad Bugs: Vim vs. Emacs vs. Claude

NASA: Artemis II Live Views from Orion

The World Sees Trump's America as a Sad Joke

TuxCraft – easy open-source tool to run Minecraft servers

The most-disliked people in the publishing industry

How do the likely/unlikely macros in the Linux kernel work?

NumPy as Synth Engine

The potential of erroneous outbound traffic

Why heroism is bad, and what we can do to stop it

When AI Fails

Microsoft closes worst quarter since 2008: 'Redmond is in a pickle'

Multi-agent systems have a distributed systems problem

Index providers shouldn't bend the rules for Elon Musk

From Organizational Hierarchy to Intelligence

Dnf5-ageist: Age verification for DNF5 [video]

Ask HN: Is weird it that Anthropic raised my API limit from $500/mo to $200k?

Better Blog AI | Automated Blog publishing to any CMS

Reverse-engineering the Alesis MMT8 firmware

OnlyOffice Gets Forked as "Made in Europe", Sparks Licensing and Trust Debate

Show HN: SwiftLM – Qwen Chat on iPhone, 100B+ Moe on M5 Pro 64GB (Native Swift)

The Sims Creator's Quest to Turn His and Your Own Mind into a Video Game

Improving my focus by giving up my big monitor

Allbirds, once valued at $4B, just sold its assets for next to nothing

DSTs Are Just Polymorphically Compiled Generics

Create polished, pro-grade screen recordings – MIT Licensed

Claude Wrote a Full FreeBSD Remote Kernel RCE with Root Shell (CVE-2026-4747)