I built a local proxy that dedupes Claude Code traffic. TokenShield — cuts your Claude Code bill 40-70%
Comments
curatedmcp•33m ago
I run Claude Code most days and my Anthropic bill kept creeping up
without me understanding which conversations were the expensive ones.
A 25-turn agentic session re-reads `auth.ts` five times and re-runs
`gh pr list` three times — every duplicate ships as a fresh
tool_result content block to the model, every time. The model already
saw identical bytes two turns ago, but it doesn't matter; you pay
for them again.
TokenShield is a small Node 22 proxy that sits in front of
api.anthropic.com. Set `ANTHROPIC_BASE_URL=http://127.0.0.1:7777` and
Claude Code (or Cursor, Windsurf, Zed, Aider — anything that uses the
Anthropic SDK) routes through it. The proxy hashes every tool_result
content block per-conversation; subsequent occurrences of the same
hash within the same conversation are replaced with a deterministic
stub:
{ "type": "tool_result",
"tool_use_id": "...",
"content": "[tokenshield: identical content seen at message 2, sha:1f6063fe]" }
The deterministic part matters — same input always produces the same
output, so Anthropic's prompt_cache (cache_control) stays valid.
The accounting side was the bug-hunt I didn't expect. First version
captured 0 tokens on successful responses. Turned out Anthropic
returns Content-Encoding: gzip and I was running JSON.parse on the
compressed buffer. The silent try/catch swallowed the error and I
shipped "200 status, $0.00 spent" for a week before noticing. Fix:
`accept-encoding: identity` on upstream — bandwidth on localhost
doesn't matter and reliable measurement does.
Real numbers from my own usage yesterday: one Opus-4 turn paid
$0.1719, saved $0.0747 — 30.3% on that turn. Bench numbers across
three recorded fixtures (light Q&A, medium coding, heavy 25-turn
agentic): 0%, 27.7%, 62.1%. The light case correctly does nothing
— there's nothing to dedup in a 5-turn conversation with no tool use.
What's deliberately not here:
- It only helps API-billed clients. Claude Desktop and claude.ai web
are flat-subscription / hardcoded endpoint — no metered bill to
compress.
- OpenAI / Gemini are on a waitlist. The provider adapter interface
is in the code, but I'd rather ship one provider correctly than
three half-baked.
- Compression is conservative (>= 256 byte payloads, content-hash
keyed). It does not touch streaming SSE bytes — passthrough is
byte-faithful so the prompt cache stays valid.
The piece I'd most appreciate feedback on: I'm staring at
`response-cache` as the next processor (LRU+TTL on (model,
system_prompt_hash, last_user_msg_hash)), but the correctness story
scares me — `temperature > 0` and tool-using conversations both
make caching risky. Would love thoughts from anyone who has done
semantic caching on LLM responses about where the landmines are.
`npm i -g @curatedmcp/tokenshield` to try it. MIT, self-hosted, your
API key never leaves the machine, telemetry is opt-in and
aggregate-only.
curatedmcp•33m ago
TokenShield is a small Node 22 proxy that sits in front of api.anthropic.com. Set `ANTHROPIC_BASE_URL=http://127.0.0.1:7777` and Claude Code (or Cursor, Windsurf, Zed, Aider — anything that uses the Anthropic SDK) routes through it. The proxy hashes every tool_result content block per-conversation; subsequent occurrences of the same hash within the same conversation are replaced with a deterministic stub:
The deterministic part matters — same input always produces the same output, so Anthropic's prompt_cache (cache_control) stays valid.The accounting side was the bug-hunt I didn't expect. First version captured 0 tokens on successful responses. Turned out Anthropic returns Content-Encoding: gzip and I was running JSON.parse on the compressed buffer. The silent try/catch swallowed the error and I shipped "200 status, $0.00 spent" for a week before noticing. Fix: `accept-encoding: identity` on upstream — bandwidth on localhost doesn't matter and reliable measurement does.
Real numbers from my own usage yesterday: one Opus-4 turn paid $0.1719, saved $0.0747 — 30.3% on that turn. Bench numbers across three recorded fixtures (light Q&A, medium coding, heavy 25-turn agentic): 0%, 27.7%, 62.1%. The light case correctly does nothing — there's nothing to dedup in a 5-turn conversation with no tool use.
What's deliberately not here:
- It only helps API-billed clients. Claude Desktop and claude.ai web are flat-subscription / hardcoded endpoint — no metered bill to compress. - OpenAI / Gemini are on a waitlist. The provider adapter interface is in the code, but I'd rather ship one provider correctly than three half-baked. - Compression is conservative (>= 256 byte payloads, content-hash keyed). It does not touch streaming SSE bytes — passthrough is byte-faithful so the prompt cache stays valid.
The piece I'd most appreciate feedback on: I'm staring at `response-cache` as the next processor (LRU+TTL on (model, system_prompt_hash, last_user_msg_hash)), but the correctness story scares me — `temperature > 0` and tool-using conversations both make caching risky. Would love thoughts from anyone who has done semantic caching on LLM responses about where the landmines are.
`npm i -g @curatedmcp/tokenshield` to try it. MIT, self-hosted, your API key never leaves the machine, telemetry is opt-in and aggregate-only.
More at: https://curatedmcp.com/tokenshield