Curious how it holds up as note volume grows. At some point you're still doing N relevance checks per tool call. do you hit a scaling limit there or does chaching keep it manageable? Also wondering if you've seen any drift in relevance when notes become more numerous or overlapping
NBenkovich•1h ago
CLAUDE.md is manual — you maintain it. MEMORY.md is automatic but it only gets read once at session startup, it's scoped per repo so your cross-project preferences don't carry over, and once it gets long enough only the top N lines are read. Older notes just silently disappear from context regardless of how important they are. I had a note about a critical config value fall off the bottom and cost me an hour.
So the core idea: instead of a flat file, use a small LLM as memory. Store notes in the LLM's context window (system prompt), prompt it with what the agent is currently doing, and get back only the hints that are actually relevant right now. Not at startup — right before each tool call.
You write notes like this:
``` cmr-memory write "auth token expiry is 900s not 3600s" --when "working on auth, tokens, config.ts" ```
The `--when` field is a plain-language activation condition. The memory LLM reads the conversation transcript and decides if the condition matches the current context. It's not keyword matching — a note tagged "working on authentication" will surface when the agent is editing JWT middleware even if neither word appears in the recent transcript.
The tricky part is making this fast enough to run on every tool call without being expensive. One LLM call over all your notes doesn't scale — attention degrades in long contexts, costs add up fast, and you'd be waiting 5 seconds before every file read. The solution is splitting notes into small chunks and processing them in parallel with Haiku (fast, cheap, good enough for this job — you don't need Opus to decide if a note is contextually relevant). Each chunk lives in a stable system prompt that gets cached by Anthropic's prompt caching, so after the first call against a chunk it's mostly served from cache.
There's also a deduplication step — once a hint is in the conversation it's already there, so re-injecting it next tool call just bloats context for no reason. The system extracts what's already present and skips anything redundant.
The whole thing wires in via Claude Code's PreToolUse/PostToolUse hooks. PreToolUse fires before every tool call, reads the transcript, runs retrieval, injects hints. PostToolUse sends a short nudge asking if anything worth saving just happened — the agent decides whether to write a note, nothing is forced.
To set it up:
``` npm install -g @agynio/cmr-memory cmr-m