Show HN: Butter, a muscle memory cache for LLMs

https://docs.butter.dev

23•edunteman•4mo ago

Hi HN, Erik here. Today we launch Butter, an OpenAI-compatible API proxy that caches LLM generations and serves them deterministically on revisit.

Since April, we’ve been working on this concept of “muscle memory,” or deterministic replay, for agent systems performing automations. You may recall our first post in May, launching a python package called Muscle Mem: https://news.ycombinator.com/item?id=43988381

Since then, the product has evolved entirely, now taking the form of an LLM Proxy. For a deep dive into this process, check out: https://blog.butter.dev/muscle-mem-as-a-proxy

The proxy’s killer feature is being template-aware, meaning it can reuse cache entries across structurally similar requests. Inducing variable structure from context windows is no easy task, which we cover in a technical writeup here: https://blog.butter.dev/template-aware-caching

The proxy is currently open-access and free to use so we can quickly discover and work through a slew of edge cases and template-induction errors. There’s much work to be done before it’s technically sound, but we’d love to see you take Butter for a spin and share how it went, where it breaks, if it’s helpful, if we're going down a dead end, etc.

Cheers!

Comments

ketan_around•4mo ago

Exciting to see a product like this launch! There are obviously a host of ‘memory’ solutions out there that try to integrate in fancy ways to cache knowledge / save tokens, but I think there’s a beauty in simplicity to just having a proxy over the OpenAI endpoint.

Interested to see where this goes!

edunteman•4mo ago

An interesting alternative product to offer is injecting prompt cache tokens into requests where they could be helpful; not bypassing generations but at least low hanging fruit for cost savings

tsvoboda•4mo ago

looks pretty cool! How would you integrate this into production agent stacks like langchain, autogpt, even closed loop robotics?

edunteman•4mo ago

Thanks! For langchain you can repoint your base_url in the client. Autogpt I'm not as familiar with. Closed loop robotics using LLMs may be a stretch for now, especially since vision is a heavy component, but theoretically the patterns baked into small language models running on-device or hosted LLMs at higher level planning loops, could be emulated by a butter cache if observed in high enough volume.

raymondtana•4mo ago

For AutoGPT, there is the option to set a llamafile endpoint, which follows the Chat Completions API. So, theoretically, you should be able to use that to point to Butter's LLM proxy.

samraaj•4mo ago

logged back in to HN to comment on this. looks really sick - i've been saying for a while that a surprising amount of LLM inference really comes down to repetition down a known path.

it's good to see others have seen this problem and are working to make things more efficient. I'm excited to see where this goes.

MorganGallant•4mo ago

I've known Erik for a while now — simply incredible founder. Doing this as a simple API proxy makes this practically effortless to integrate into existing systems, just a simple URL swap and you're good to go. Then, it's just a matter of watching the cache hit rate go up!

zyadelgohary1•4mo ago

This is awesome, Erik! Excited to see this launch. Definitely fixes some issues we had while building pure CopyCat

bigwheels•4mo ago

Are you able to walk through a specific use case or example case in detail? I'm not yet totally grokking what Butter is going to do exactly.

edunteman•4mo ago

I've got a blog on this from the launch of Muscle Mem, which should paint a better picture https://erikdunteman.com/blog/muscle-mem

Computer use agents (as an RPA alternative) is the easiest example to reach to: UIs change but not often, so the "trajectory" of click and key entry tool calls is mostly fixed over time and worth feeding to the agent as a canned trajectory. I discuss the flaws of computer use and RPA in the blog above.

A counterexample is coding agents: it's a deeply user-interractive workflow reading from a codebase that's evolving. So the set of things the model is inferencing on is always different, and trajectories are never repeated.

Hope this helps

bigwheels•4mo ago

Still not clear - the tool calls come from the model, so what is being cached by Muscle Memory?

Also:

  After my time building computer-use agents, I’m convinced that the hybrid approach of Muscle Memory is the only viable way to offer 100% coverage on an RPA workload.

100% coverage of what?

I guess it'd be great if you could clarify the value proposition, many folks will be even less patient than myself.

Best of luck!

Beyond Agentic Coding

OpenClaw ClawHub Broken Windows Theory – If basic sorting isn't working what is?

OpenBSD Copyright Policy

OpenClaw Creator: Why 80% of Apps Will Disappear

What Happens When Technical Debt Vanishes?

AI Is Finally Eating Software's Total Market: Here's What's Next

Computer Science from the Bottom Up

Show HN: I built a toy compiler as a young dev

You don't need Mac mini to run OpenClaw

Learning to Reason in 13 Parameters

Convergent Discovery of Critical Phenomena Mathematics Across Disciplines

Ask HN: Will GPU and RAM prices ever go down?

From hunger to luxury: The story behind the most expensive rice (2025)

Substack makes money from hosting Nazi newsletters

A New Crypto Winter Is Here and Even the Biggest Bulls Aren't Certain Why

Moltbook was peak AI theater

Why Claude Cowork is a math problem Indian IT can't solve

Show HN: Built an space travel calculator with vanilla JavaScript v2

Why a 175-Year-Old Glassmaker Is Suddenly an AI Superstar

Micro-Front Ends in 2026: Architecture Win or Enterprise Tax?

These White-Collar Workers Actually Made the Switch to a Trade

The Wonder Drug That's Plaguing Sports

Show HN: Which chef knife steels are good? Data from 540 Reddit tread

Federated Credential Management (FedCM)

Token-to-Credit Conversion: Avoiding Floating-Point Errors in AI Billing Systems

The Story of Heroku (2022)

Obey the Testing Goat

Claude Opus 4.6 extends LLM pareto frontier

Brute Force Colors (2022)

Google Translate apparently vulnerable to prompt injection

Beyond Agentic Coding

OpenClaw ClawHub Broken Windows Theory – If basic sorting isn't working what is?

OpenBSD Copyright Policy

OpenClaw Creator: Why 80% of Apps Will Disappear

What Happens When Technical Debt Vanishes?

AI Is Finally Eating Software's Total Market: Here's What's Next

Computer Science from the Bottom Up

Show HN: I built a toy compiler as a young dev

You don't need Mac mini to run OpenClaw

Learning to Reason in 13 Parameters

Convergent Discovery of Critical Phenomena Mathematics Across Disciplines

Ask HN: Will GPU and RAM prices ever go down?

From hunger to luxury: The story behind the most expensive rice (2025)

Substack makes money from hosting Nazi newsletters

A New Crypto Winter Is Here and Even the Biggest Bulls Aren't Certain Why

Moltbook was peak AI theater

Why Claude Cowork is a math problem Indian IT can't solve

Show HN: Built an space travel calculator with vanilla JavaScript v2

Why a 175-Year-Old Glassmaker Is Suddenly an AI Superstar

Micro-Front Ends in 2026: Architecture Win or Enterprise Tax?

These White-Collar Workers Actually Made the Switch to a Trade

The Wonder Drug That's Plaguing Sports

Show HN: Which chef knife steels are good? Data from 540 Reddit tread

Federated Credential Management (FedCM)

Token-to-Credit Conversion: Avoiding Floating-Point Errors in AI Billing Systems

The Story of Heroku (2022)

Obey the Testing Goat

Claude Opus 4.6 extends LLM pareto frontier

Brute Force Colors (2022)

Google Translate apparently vulnerable to prompt injection

Show HN: Butter, a muscle memory cache for LLMs

Comments