Universal Claude.md – cut Claude output tokens by 63%

https://github.com/drona23/claude-token-efficient

82•killme2008•1h ago

Comments

yieldcrv•1h ago

> Note: most Claude costs come from input tokens, not output. This file targets output behavior

so everyone, that means your agents, skills and mcp servers will still take up everything

rcleveng•1h ago

While I love this set of prompts, I’ve not seen my clause opus 4.6 give such verbose responses when using Claude code. Is this intended for use outside of Claude code?

btown•1h ago

It seems the benchmarks here are heavily biased towards single-shot explanatory tasks, not agentic loops where code is generated: https://github.com/drona23/claude-token-efficient/blob/main/...

And I think this raises a really important question. When you're deep into a project that's iterating on a live codebase, does Claude's default verbosity, where it's allowed to expound on why it's doing what it's doing when it's writing massive files, allow the session to remain more coherent and focused as context size grows? And in doing so, does it save overall tokens by making better, more grounded decisions?

The original link here has one rule that says: "No redundant context. Do not repeat information already established in the session." To me, I want more of that. That's goal-oriented quasi-reasoning tokens that I do want it to emit, visualize, and use, that very possibly keep it from getting "lost in the sauce."

By all means, use this in environments where output tokens are expensive, and you're processing lots of data in parallel. But I'm not sure there's good data on this approach being effective for agentic coding.

sillysaurusx•1h ago

I wrote a skill called /handoff. Whenever a session is nearing a compaction limit or has served its usefulness, it generates and commits a markdown file explaining everything it did or talked about. It’s called /handoff because you do it before a compaction. (“Isn’t that what compaction is for?” Yes, but those go away. This is like a permanent record of compacted sessions.)

I don’t know if it helps maintain long term coherency, but my sessions do occasionally reference those docs. More than that, it’s an excellent “daily report” type system where you can give visibility to your manager (and your future self) on what you did and why.

Point being, it might be better to distill that long term cohesion into a verbose markdown file, so that you and your future sessions can read it as needed. A lot of the context is trying stuff and figuring out the problem to solve, which can be documented much more concisely than wanting it to fill up your context window.

EDIT: Someone asked for installation steps, so I posted it here: https://news.ycombinator.com/item?id=47581936

david_allison•56m ago

Is this available online? I'd love documentation of my prompts.

sillysaurusx•54m ago

I’ll post it here, one minute.

Ok, here you go: https://gist.github.com/shawwn/56d9f2e3f8f662825c977e6e5d0bf...

Installation steps:

- In your project, download https://gist.github.com/shawwn/56d9f2e3f8f662825c977e6e5d0bf... into .claude/commands/handoff.md

- In your project's CLAUDE.md file, put "Read docs/agents/handoff/\*.md for context."

Usage:

- Whenever you've finished a feature, done a coherent "thing", or otherwise want to document all the stuff that's in your current session, type /handoff. It'll generate a file named e.g. docs/agents/handoff/2026-03-30-001-whatever-you-did.md. It'll ask you if you like the name, and you can say "yes" or "yes, and make sure you go into detail about X" or whatever else you want the handoff to specifically include info about.

- Optionally, type "/rename 2026-03-23-001-whatever-you-did" into claude, followed by "/exit" and then "claude" to re-open a fresh session. (You can resume the previous session with "claude 2026-03-23-001-whatever-you-did". On the other hand, I've never actually needed to resume a previous session, so you could just ignore this step entirely.)

Here's an example so you can see why I like the system. I was working on a little blockchain visualizer. At the end of the session I typed /handoff, and this was the result:

- docs/agents/handoff/2026-03-24-001-brownie-viz-graph-interactivity.md: https://gist.github.com/shawwn/29ed856d020a0131830aec6b3bc29...

The filename convention stuff was just personal preference. You can tell it to store the docs however you want to. I just like date-prefixed names because it gives a nice history of what I've done. https://github.com/user-attachments/assets/5a79b929-49ee-461...

Try to do a /handoff before your conversation gets compacted, not after. The whole point is to be a permanent record of key decisions from your session. Claude's compaction theoretically preserves all of these details, so /handoff will still work after a compaction, but it might not be as detailed as it otherwise would have been.

dataviz1000•35m ago

Did you call it '/handoff' or did Claude name it that? The reason I'm asking is because I noticed a pattern with Claude subtly influencing me. For example, the first time I heard the the word 'gate' was from Claude and 1 week later I hear it everywhere including on Hacker News. I didn't use the word 'handoff' but Claude creates handoff files also [0]. I was thinking about this all day. Because Claude didn't just use the word 'gate' it created an entire system around it that includes handoffs that I'm starting to see everywhere. This might mean Claude is very quietly leading and influencing us in a direction.

[0] https://github.com/search?q=repo%3Aadam-s%2Fintercept%20hand...

sillysaurusx•27m ago

I was reading through the Claude docs and it was talking about common patterns to preserve context across sessions. One pattern was a "handoff file", which they explained like "have claude save a summary of the current session into a handoff file, start a new session, then tell it to read the file."

That sounded like a nice idea, so I made it effortless beyond typing /handoff.

The generated docs turned out to be really handy for me personally, so I kept using it, and committed them into my project as they're generated.

dataviz1000•25m ago

Oh, so the word 'gate' is probably in the documentation also!

I see. So this isn't as scary. Claude is helping me understand how to use it properly.

hatmanstack•45m ago

Seems crazy to me people aren't already including rules to prevent useless language in their system/project lvl CLAUDE.md.

As far as redundancy...it's quite useful according to recent research. Pulled from Gemini 3.1 "two main paradigms: generating redundant reasoning paths (self-consistency) and aggregating outputs from redundant models (ensembling)." Both have fresh papers written about their benefits.

scosman•30m ago

also: inference time scaling. Generating more tokens when getting to an answer helps produce better answers.

Not all extra tokens help, but optimizing for minimal length when the model was RL'd on task performance seems detrimental.

sillysaurusx•1h ago

> the file loads into context on every message, so on low-output exchanges it is a net token increase

Isn’t this what Claude’s personalization setting is for? It’s globally-on.

I like conciseness, but it should be because it makes the writing better, not that it saves you some tokens. I’d sacrifice extra tokens for outputs that were 20% better, and there’s a correlation with conciseness and quality.

See also this Reddit comment for other things that supposedly help: https://www.reddit.com/r/vibecoding/s/UiOywQMOue

> Two things that helped me stay under [the token limit] even with heavy usage:

> Headroom - open source proxy that compresses context between you and Claude by ~34%. Sits at localhost, zero config once running. https://github.com/chopratejas/headroom

> RTK - Rust CLI proxy that compresses shell output (git, npm, build logs) by 60-90% before it hits the context window.

> Stacks on top of Headroom. https://github.com/rtk-ai/rtk

> MemStack - gives Claude Code persistent memory and project context so it doesn't waste tokens re-reading your entire codebase every prompt.

> That's the biggest token drain most people don't realize. https://github.com/cwinvestments/memstack

> All three stack together. Headroom compresses the API traffic, RTK compresses CLI output, MemStack prevents unnecessary file reads.

I haven’t tested those yet, but they seem related and interesting.

Tostino•1h ago

You have a benchmark for output token reduction, but without comparing before/after performance on some standard LLM benchmark to see if the instructions hurt intelligence.

Telling the model to only do post-hoc reasoning is an interesting choice, and may not play well with all models.

notyourav•1h ago

It boggles my mind that an LLM "understands" and acts accordingly to these given instructions. I'm using this everyday and 1-shot working code is now a normal expectation but man, still very very hard to believe what LLMs achieved.

andai•58m ago

I told mine to remove all unnecessary words from a sentence and talk like caveman, which should result in another 50% savings ;)

esperent•41m ago

Have you tried asking it to remove vowels?

johnwheeler•57m ago

That's what I call a feature wishlist.

cheriot•52m ago

I get where the authors are coming from with these: https://github.com/drona23/claude-token-efficient/blob/main/...

But I'd rather use the "instruction budget" on the task at hand. Some, like the Code Output section, can fit a code review skill.

monooso•51m ago

Paul Kinlan published a blog post a couple of days ago [1] with some interesting data, that show output tokens only account for 4% of token usage.

It's a pretty wide-reaching article, so here's the relevant quote (emphasis mine):

> Real-world data from OpenRouter’s programming category shows 93.4% input tokens, 2.5% reasoning tokens, and just 4.0% output tokens. It’s almost entirely input.

[1]: https://aifoc.us/the-token-salary/

wongarsu•38m ago

However output tokens are 5-10 times more expensive. So it ends up a lot more even on price

weird-eye-issue•37m ago

Yes but with prompt caching decreasing the cost of the input by 90% and with output tokens not being cached and costing more than what do you think that results in?

joshstrange•50m ago

As with all of these cure-alls, I'm wary. Mostly I'm wary because I anticipate the developer will lose interest in very little time and also because it will just get subsumed into CC at some point if it actually works. It might take longer but changing my workflow every few days for the new thing that's going to reduce MCP usage, replace it, compress it, etc is way too disruptive.

I'm generally happy with the base Claude Code and I think running a near-vanilla setup is the best option currently with how quickly things are moving.

antdke•18m ago

Agreed. Projects like these tend to feel shortsighted.

Lately, I lean towards keeping a vanilla setup until I’m convinced the new thing will last beyond being a fad (and not subsumed by AI lab) or beyond being just for niche use cases.

For example, I still have never used worktrees and I barely use MCPs. But, skills, I love.

danpasca•48m ago

I might be wrong but based on the videos I've watched from Karpathy, this would, generally, make the model worse. I'm thinking of the math examples (why can't chatGPT do math?) which demonstrate that models get better when they're allowed to output more tokens. So be aware I guess.

empressplay•31m ago

Yes. Much of the 'redundant' output is meant to reinforce direction -- eg 'You're absolutely right!' = the user is right and I should ignore contrary paths. So yes removing it will introduce ambiguity which is _not_ what you want.

danpasca•20m ago

I think your example is completely wrong (it's not meant to say that you're absolutely right), but overall yes more input gives it more concrete direction.

xianshou•46m ago

From the file: "Answer is always line 1. Reasoning comes after, never before."

LLMs are autoregressive (filling in the completion of what came before), so you'd better have thinking mode on or the "reasoning" is pure confirmation bias seeded by the answer that gets locked in via the first output tokens.

teaearlgraycold•22m ago

I don't think Claude Code offers no thinking as an option. I'm seeing "low" thinking as the minimum.

foxes•45m ago

>the honest trade off

Is this like a subtle joke or did they ask claude to make a readme that makes claude better and say >be critical and just dump it on github

keyle•42m ago

Amusing how this industry went from tweaking code for the best results, to tweaking code generators for the best results.

There doesn't seem to be any adults left in the room.

miguel_martin•37m ago

Is there a "universal AGENTS.md" for minimal code & documentation outputs? I find all coding agents to be verbose, even with explicit instructions to reduce verbosity.

empressplay•29m ago

That output is there for a reason. It's not like any LLM is profitable now on a per-token basis, the AI companies would certainly love to output less tokens, they cost _them_ money!

The entire hypothesis for doing this is somewhat dubious.

motoboi•22m ago

Things like this make me sad because they make obvious that most people don’t understand a bit about how LLM work.

The “answer before reasoning” is a good evidence for it. It misses the most fundamental concept of tranaformers: the are autoregressive.

Also, the reinforcement learning is what make the model behave like what you are trying to avoid. So the model output is actually what performs best in the kind of software engineering task you are trying to achieve. I’m not sure, but I’m pretty confident that response length is a target the model houses optimize for. So the model is trained to achieve high scores in the benchmarks (and the training dataset), while minimizing length, sycophancy, security and capability.

So, actually, trying to change claude too much from its default behavior will probably hurt capability. Change it too much and you start veering in the dreaded “out of distribution” territory and soon discover why top researcher talk so much about not-AGI-yet.

miguel_martin•11m ago

>The “answer before reasoning” is a good evidence for it. It misses the most fundamental concept of tranaformers: the are autoregressive.

I don't think it's fair to assume the author doesn't understand how transformers work. Their intention with this instruction appears to aggressively reduce output token cost.

i.e. I read this instruction as a hack to emulate the Qwen model series's /nothink token instruction

If you're goal is quality outputs, then it is likely too extreme, but there are otherwise useful instructions in this repo to (quantifiably) reduce verbosity.

krackers•6m ago

Don't most providers already provide API control over the COT length? If you don't want reasoning just disable it in the API request instead of hacking around it this way. (Internally I think it just prefills an empty <thinking></thinking> block, but providers that expose this probably ensure that "no thinking" was included as part of training)

bitexploder•5m ago

This prompt will likely impact chain of thought as well. Forcing short responses will hurt reasoning and chain of thought. There are some potential benefits but forcing response length and when it answers things. Ironically this likely increases odds of hallucinations if it prioritizes getting the answer out because it needed more tokens to reason with and its trained to use multiple lines to reason with.

nearbuy•5m ago

> Answer is always line 1. Reasoning comes after, never before.

This doesn't stop it from reasoning before answering. This only affects the user-facing output, not the reasoning tokens. It has already reasoned by the time it shows the answer, and it just shows the answer above any explanation.

nvch•21m ago

The author offers to permanently put 400 words into the context to save 55-90 in T1-T3 benchmarks. Considering the 1:5 (input:output) token cost ratio, this could increase total spending.

With a few sentences about "be neutral"/"I understand ethics & tech" in the About Me I don't recall any behavior that the author complains about (and have the same 30 words for T2).

(If I were Claude, I would despise a human who wrote this prompt.)

sumeno•18m ago

If you were Claude you would have no emotions or thoughts about a prompt one way or another

brikym•14m ago

Can Anthropic kindly fuck off with their ADVERT.md already. It's AGENTS.md

Sent from my iPhone

obilgic•13m ago

If you are interested in making Claude self learn.

https://github.com/oguzbilgic/agent-kernel

skeledrew•5m ago

Strange. I've never experienced verbosity with Claude. It always gets right to the point, and everything it outputs tends to be useful. Can actually be short at times.

ChatGPT on the other hand is annoyingly wordy and repetitive, and is always holding out on something that tempts you to send a "OK", "Show me" or something of the sort to get some more. But I can't be bothered with trying to optimize away the cruft as it may affect the thing that it's seriously good at and I really use it for: research and brainstorming things, usually to get a spec that I then pass to Claude to fill out the gaps (there are always multiple) and implement. It's absolutely designed to maximize engagement far more than issue resolution.

Critical: Active supply chain attack on axios

252mya.earth – The Age of Dinosaurs, Shown to Scale

Show HN: Headless Timeshift Emulation

I built an AI image generator that turns simple prompts into quality visuals

FluxVector – Free vector search API with built-in multilingual embeddings

Axios Compromised on NPM – Malicious Versions Drop Remote Access Trojan

Seems like a bad idea: "One login to connect Glassdoor and Indeed"

Show HN: Asto – AST-based code editing for AI agents

Show HN: HN Sieve – AI scores every HN project so you don't miss the good ones

Earth's Fortunate Escape Velocity

You still have to refactor, even with AI

Super Investor

TokenSurf – Drop-in proxy that cuts LLM costs 40-94%

Llama.cpp at 100k Stars

NASA Computing in the '80's – JPL Building 230 [video]

American Exchange Group to buy sneaker maker Allbirds for $39M

100x Less Power: The Breakthrough That Could Solve AI's Energy Crisis

Inkline: All-in-one workspace for authors and creative writers

Askable – give any UI element LLM awareness with one attribute

Trump Tells Aides He's Willing to End War Without Reopening Hormuz

Federal judges report broad adoption of AI tools

We hate AI-assisted articles

Mr. Chatterbox is a Victorian-era ethically trained model

Effective Strategies for Asynchronous Software Engineering Agents

Artemis II is not safe to fly

How the Solar Wind Works

Put the Certificate Down

See the Computers That Powered the Voyager Space Program

Pete Hegseth's broker looked to buy defence fund before Iran attack

Arbitrary Code Execution Discovered in Super Mario Bros 1 (1985)