Author here. I built Headroom because I was spending $200/day running agents with tool calls.
The problem: tools return huge JSON (search results, DB queries, file listings). Each response bloats context. By turn 10, you're paying for 100k+ tokens on every LLM call.
Existing solutions have a fundamental tradeoff:
- Truncation: fast but you might cut data the model needs
- Summarization: slow (~500ms) and still lossy
- Bigger context: just delays the problem, costs more
The insight behind Headroom:
You can't know which data matters until the model tries to use it. So instead of guessing, compress aggressively AND keep a retrieval path.
1. Smart compression - not random truncation. For JSON arrays, we keep errors (100%), statistical anomalies, items matching the user's query (BM25 + embeddings), first/last items. For code, we use tree-sitter AST parsing to preserve imports, signatures, types - output is guaranteed syntactically valid. For logs, we keep errors and state transitions.
2. CCR (Compress-Cache-Retrieve) - everything compressed gets cached locally. We inject a `headroom_retrieve` tool. If the model needs more data, it asks and gets it in <1ms.
The retrieval is what makes aggressive compression safe. In practice, the model almost never retrieves because the smart compression keeps what matters. But when it does need more, it can get it.
Results on my workloads:
- Search results (1000 items): 45k → 4.5k tokens (90%)
- Agent with tools (10 calls): 100k → 15k tokens (85%)
- Overhead: 1-5ms per request
Usage:
As a proxy (zero code changes):
pip install "headroom-ai[proxy]"
headroom proxy --port 8787
ANTHROPIC_BASE_URL=http://localhost:8787 claude
Or wrap your client:
from headroom import HeadroomClient
client = HeadroomClient(OpenAI())
LangChain integration is one line.
Limitations I'm aware of:
- CCR adds memory overhead (LRU cache, configurable)
- AST compression requires tree-sitter (~50MB)
- Not battle-tested on all edge cases yet
Happy to answer questions about the compression algorithms, the retrieval mechanism, or anything else.
chopratejas•1h ago
The problem: tools return huge JSON (search results, DB queries, file listings). Each response bloats context. By turn 10, you're paying for 100k+ tokens on every LLM call.
Existing solutions have a fundamental tradeoff: - Truncation: fast but you might cut data the model needs - Summarization: slow (~500ms) and still lossy - Bigger context: just delays the problem, costs more
The insight behind Headroom:
You can't know which data matters until the model tries to use it. So instead of guessing, compress aggressively AND keep a retrieval path.
Results on my workloads: - Search results (1000 items): 45k → 4.5k tokens (90%) - Agent with tools (10 calls): 100k → 15k tokens (85%) - Overhead: 1-5ms per requestUsage:
Or wrap your client: from headroom import HeadroomClient client = HeadroomClient(OpenAI())LangChain integration is one line.
Limitations I'm aware of: - CCR adds memory overhead (LRU cache, configurable) - AST compression requires tree-sitter (~50MB) - Not battle-tested on all edge cases yet
Happy to answer questions about the compression algorithms, the retrieval mechanism, or anything else.