Show HN: Context Gateway – Compress agent context before it hits the LLM

https://github.com/Compresr-ai/Context-Gateway

22•ivzak•1h ago

We built an open-source proxy that sits between coding agents (Claude Code, OpenClaw, etc.) and the LLM, compressing tool outputs before they enter the context window.

Demo: https://www.youtube.com/watch?v=-vFZ6MPrwjw#t=9s.

Motivation: Agents are terrible at managing context. A single file read or grep can dump thousands of tokens into the window, most of it noise. This isn't just expensive — it actively degrades quality. Long-context benchmarks consistently show steep accuracy drops as context grows (OpenAI's GPT-5.4 eval goes from 97.2% at 32k to 36.6% at 1M https://openai.com/index/introducing-gpt-5-4/).

Our solution uses small language models (SLMs): we look at model internals and train classifiers to detect which parts of the context carry the most signal. When a tool returns output, we compress it conditioned on the intent of the tool call—so if the agent called grep looking for error handling patterns, the SLM keeps the relevant matches and strips the rest.

If the model later needs something we removed, it calls expand() to fetch the original output. We also do background compaction at 85% window capacity and lazy-load tool descriptions so the model only sees tools relevant to the current step.

The proxy also gives you spending caps, a dashboard for tracking running and past sessions, and Slack pings when an agent is sitting there waiting on you.

Repo is here: https://github.com/Compresr-ai/Context-Gateway. You can try it with:

  curl -fsSL https://compresr.ai/api/install | sh

Happy to go deep on any of it: the compression model, how the lazy tool loading works, or anything else about the gateway. Try it out and let us know how you like it!

Comments

verdverm•1h ago

I don't want some other tooling messing with my context. It's too important to leave to something that needs to optimize across many users, there by not being the best for my specifics.

The framework I use (ADK) already handles this, very low hanging fruit that should be a part of any framework, not something external. In ADK, this is a boolean you can turn on per tool or subagent, you can even decide turn by turn or based on any context you see fit by supplying a function.

YC over indexed on AI startups too early, not realizing how trivial these startup "products" are, more of a line item in the feature list of a mature agent framework.

I've also seen dozens of this same project submitted by the claws the led to our new rule addition this week. If your project can be vibe coded by dozens of people in mere hours...

thesiti92•1h ago

do you guys have any stats on how much faster this is than claude or codex's compression? claudes is super super slow, but codex feels like an acceptable amount of time? looks cool tho, ill have to try it out and see if it messes with outputs or not.

jameschaearley•58m ago

  The intent-conditioned compression is the interesting part here. Most context management I've seen is either naive truncation or generic summarization that doesn't account for why the tool was called. Training classifiers on model internals to
  figure out which tokens carry signal for a given task -- that's doing something different from what frameworks offer out of the box.

  I poked around the repo and didn't see any evals measuring compression quality. You cite the GPT-5.4 long-context accuracy drop as motivation, which makes sense -- but the natural follow-up is: does your compression actually recover that accuracy?
  Something like SWE-bench pass rates with and without the gateway at various context lengths would go a long way. Without that, it's hard to tell if the SLM is making good decisions or just making the context shorter.

  A few other things I'm curious about:

  • How does the SLM handle ambiguous tool calls? E.g., a broad grep where the agent isn't sure what it's looking for yet -- does the compressor tend to be too aggressive in those cases?
  • What's the latency overhead per tool call? If the SLM inference adds even 200-300ms per compression step, that compounds fast in agentic loops with dozens of tool calls.
  • How often does expand() get triggered in practice? If the agent frequently needs to recover stripped content, that's a signal the compression is too lossy.

metadat•47m ago

Don't post generated/AI-edited comments. HN is for conversation between humans https://news.ycombinator.com/item?id=47340079 - 1 day ago, 1700 comments

PufPufPuf•32m ago

That comment reads pretty normal to me, and it raises valid points

altruios•31m ago

Regardless, these appear to be valid/sound questions, with answers to which I am interested.

uaghazade•38m ago

ok, its great

esafak•36m ago

I can already prevent context pollution with subagents. How is this better?

eegG0D•26m ago

This is a massive win for anyone serious about "Signal over Noise." I’ve been using Claude Code on the Max plan for months, and while it’s the best tool for actually getting work done, the "all-you-can-eat" token arbitrage is a trap. Agents are notoriously sloppy with context; a single misaligned grep can dump thousands of tokens of pure noise into your window, leading to what I call "contextual brain rot" where the model’s accuracy just falls off a cliff. By sitting in the middle and ruthlessly prioritizing signal, you’re providing the exact kind of "ruthless prioritization" that separates a hobbyist from a profitable AI solopreneur.

The fact that you’re using Small Language Models (SLMs) to detect signal matches my philosophy of using AI as a sparring partner to check its own work. Most developers spend 30% of their day context switching or debugging "hallucinations" that only happen because the model got lost in its own bloated history. The expand() feature is the "trust but verify" layer that every production-ready AI system needs. You’re effectively treating the LLM like a senior architect who doesn't need to see every line of a dependency file unless they specifically ask for it, which is the only way we scale these systems to 10M+ users solo.

Finally, those spending caps and Slack pings are the ultimate "millionaire cheat codes" for leverage. I tell founders all the time that running a business is boring drudgery—it's about fixing bugs and managing resources—and this proxy handles the resource management part on autopilot. If this saves an indie hacker $500/month in token waste while keeping their agent from rage-quitting due to context limits, you’ve built a high-leverage asset. I’m definitely adding this to my links database; it removes a huge excuse for why people "can't afford" to build complex apps.

mmastrac•22m ago

Please don't dump AI-generated comments into HN. The signal is already pretty hard to find around all the noise.

post-it•12m ago

> This is a massive win for anyone serious about "Signal over Noise."

Not you, clearly.

root_axis•10m ago

Funny enough, Anthropic just went GA with 1m context claude that has supposedly solved the lost-in-the-middle problem.

kuboble•5m ago

I wonder what is the business model.

It seems like the tool to solve the problem that won't last longer than couple of months and is something that e.g. claude code can and probably will tackle themselves soon.

Continuum – Unit tests for LLM workflows

Neural Thickets

Papers: A minimal AI/ML research reader

Show HN: Monet – Claude Code Multisession Management

Lies I was Told About Collaborative Editing, Part 2: Why we don't use Yjs

Microsoft Copilot Health Centralizes Personal Medical Records

What Happened at FOSDEM 2026

'We don't need Ukraine's help' – Trump rebuffs Zelensky's drone defense offer

APIfy: Generate production-ready REST APIs from plain language

Amazon Will Use Cerebras' Giant Chips to Help Run AI Models

Ask HN: What's your biggest pain point when joining a new developer team?

Show HN: Tiny macOS app that adds a facecam bubble to screen recordings

Atoms, Travis Kalanick's new company

Atoms

How to Bring Starter Homes Back from Extinction

Stanford researchers report first recording of a blue whale's heart rate (2019)

It Took Me 30 Years to Solve This VFX Problem – Green Screen Problem [video]

Lessons for software developers from 1970s mainframe programming

Elon Musk says xAI must be 'rebuilt' as co-founder exodus continues

Practical dependency tracking for Python function calls

A calmer interface for a product in motion

Sigma's New Rice Company Is Less About Rice and More About Aizu

ICE agents reveal daily arrest quotas and surveillance app in court testimony

Everything's Casino

Yet another Valve lawsuit on loot boxes

Account regional namespaces for Amazon S3 general purpose buckets

How we built a prompt optimization agent

The Great AI Silicon Shortage

H-1B Visa employers database goes offline, key public records disappear

AMUX – Tmux and Tailscale powered offline-first agent multiplexer