frontpage.
newsnewestaskshowjobs

Open Source @Github

fp.

Open in hackernews

Don't trust large context windows

https://garrit.xyz/posts/2026-05-06-dont-trust-large-context-windows
56•computersuck•2h ago

Comments

da-x•2h ago
Perhaps compacting the context can be made in multiple requests over smaller and overlapping chunks to avoid using the 'dumb zone', and for yielding a better result.
mcapodici•1h ago
I /clear all the time out of habit. I want to be able to get the thing done with minimal context. It also means you can do it again slightly different if needed, you know the seed conditions for the task.
afc•1h ago
The approach we're taking to deal with this very real context rot is using a bunch of related techniques which we call transposing the agent loop: https://alejo.ch/3jt

In essence, we run many short agent loops, generating their prompts dynamically from structured data. Each loop advances the state in a small step towards the final goal.

kristianc•1h ago
I'm getting a lot of mileage out of basically acting like the AI's Product Manager, and insisting that it writes up short PRDs for every feature we propose to build. That gives it a reference over time of everything that has been built, but also makes it less liable to drift with each one. Each one gets its own conversation. For me this is a happy medium between stopping it going off the rails but also making sure it can reference past decisions when it needs to. The one thing I dislike about Pocock's method (not to use PRDs so much but to have an in depth discussion to get alignment) first is it wastes a lot of the best window on that initial back and forth.
nopurpose•1h ago
Is it adhoc or you use more structured approaches like openspec? I also tend to work on a plan first, but it stays as in-session todo, which is hard to reference later.
kristianc•53m ago
It's ad hoc / my own framework, just found something which works for me. The exact structure is

- Work Mode - HITL/AFK

- Problem Statement

- Who It Affects - Primary / Secondary User

- User Stories

- Business Case

- Why Now

- Success Critera

- In Scope/Out of Scope [Out of Scope v. important)

- Thinnest Slice (This I've found super valuable, means you max out the amount of 'product' for your buck and avoid diminishing marginal returns or overbuilding. Often I will build this)

- Eigenfeature - What is the larger feature we _could_ (but probably won't) which would solve for this use case and other stuff I might not have thought of

- Technical Notes

- Deps

- Schema Changes

- Risks

- Final Recommendation [go / no go, including on scope]

There's a note in my Claude / Agents MD which says no net new feature gets introduced without this and I get it to move through a pipeline of folders (active, approved, shipped, proposed etc). All runs in a system of MD files and have even created a little MD Kanban from the metadata!

da_grift_shift•18m ago
Is there back-and-forth? How long do these get? Can you share an example?
kelnos•1h ago
This has not been my experience with Opus since Anthropic released the 1M token context window for use under the subscription plans. I routinely push past 500k tokens, even sometimes up to around 800k tokens, and don't see this problem. I've seen it to some extent when getting truly near the limit, up around and above 900k tokens, though what I see isn't as severe as the author seems to see.

(And I rarely fill the context window that far anyway when working on a single task, or a series of tasks that are related enough to warrant the same context; more typical is anywhere between 200k and 600k or so.)

I'm not saying that no one ever has this experience, but it's odd to me that some people see it so often that it warrants giving it a name.

asd88•55m ago
I’ve had similar experiences with Fable. 70%+ context used out of 1M, still sharp and no memory issues.
fullstackchris•49m ago
Thats another problem of this post, the author mentions Claude but not explicitely what models...

100k tokens "by lunch" is also not my finding, the newer models will hit that already right in the initial exploratory phase

arcanemachiner•48m ago
Really depends on the project.
stavros•21m ago
I found "by lunch" odd too, but considering that Claude wrote the article, it's not going to know specifics.
mg•1h ago
Considerations about what goes on in agents internally will probably not be part of software development for long.

Personally, I already see LLMs and agents as blackboxes. I give each feature request to multiple LLMs and then compare the results. I don't manually use "sessions" at all. I just look at the outcome. When I dislike it, I "git reset --hard", change my prompts and restart the feature request.

To have an ongoing sense of which agents perform best, I keep a log and calculate an ELO score of which agents meet my demands best. This score is imporant to me, not so much how the agent achieves it.

hypfer•56m ago
This is an absolutely crazy wasteful thing to do considering the actual cost of all that inference and nothing to be proud of.
cyanydeez•36m ago
come on now, we can't just not escape the permanent underclass by using our brains, we've also got to use up all the resources while doing it.
mg•24m ago
It is the other way round.

In an interactive session, adding "Fine, but make the button red" after the model generated a first solution more than doubles the tokens used. As the model now not only gets the original code and the feature request but also the updated code plus the change request as input tokens.

Sending a feature request to an LLM and then sending the feature request again with "The button shall be red" only doubles the tokens used.

jgilias•10m ago
The cost is far from linear though. Because of prompt caching and the fact that generally output tokens are a lot more expensive than input tokens.
PeterStuer•59m ago
I've had no problem with Claude Code Opus 4.8 effort max using 20% token context (200k) on software development tasks (all stages). I aways load core source files and the ones we are working on up front. Around 20%, I make it autoprepare for a new session and clear.

Admittedly I have been doing this precautiously, based on anecdotal evidence, not because I had bad experiences with longer context deterioration myself.

In the brief time I had access to Fable 5, it went on long running tasks (>45 mins) into the 30-40% zone without apparent context coherence problems.

mock-possum•53m ago
Hasn’t been my experience at all - 1M window is a very clear upgrade working with Claude code.
mightyham•51m ago
Even taking the author's criticism about large context windows for granted, which in my experience are exaggerated, they are still a huge UX improvement over short windows. That reason alone is enough for me to support them.
petesergeant•38m ago
Is there any chance that this is because training corpus largely consists of documents shorter than the advertised context windows?
walthamstow•34m ago
There's an env var you can set in Claude Code to bring the autocompact threshold down, effectively setting your own max context window. I have it at 400k.
jackxlau•23m ago
In my own testing I have seen peak performance happen usually within 15-20% of the intended context limit, albeit there are a few optimizations depending on the task quality.
bob1029•15m ago
I've been able to avoid context size issues by applying one simple constraint to my agent loop. What I do is prevent all tool calling in the user's top-level conversation thread. Anything that needs to tool call must happen in a recursive invoke of the agent, which returns whatever results to caller.

I can keep the same high level conversation going for an entire day over a million LOC+ codebase without ever hitting meaningful token limits. No compaction or summarization tricks needed. I can burn 50 million tokens in recursive calls and still not touch 100k tokens in my root conversation thread.

There is some rework needed to "bootstrap" the agent each time it has to descend back into Narnia, but this is still far more efficient than carrying around one big flat context that tries to cover everything all the time.

Recursion is very effective at controlling token use, but it can only go so far. I've not observed any uplift for recursive depth beyond 1. I have seen the agent attempt it a few times, but the practical performance is simply not there. External symbolic recursion does not appear to be something the frontier models have been trained for. They are fantastic at emulating recursion in context, but we don't want that if we are trying to achieve a reduction in token use.

Febriss33•4m ago
i let the main loop spawn sub terminal via tmux to prevent large contexts. it's great to divide tasks in small patterns and consolidate it step by step.
arcanemachiner•48m ago
Opus 4.6 was on drugs past 200k, I skipped 4.7, 4.8 did good up to ~350k, and Fable did great beyond 400k, in my limited testing. The quality does appear to be trending upwards.
Bolwin•39m ago
I see this said often and find it insane given how many times I find opus models making basic recall mistakes at <100k tokens.

Personally I consider < 60k to be the smart zone for opus. This is worse for opus 4.7 and 4.8 cause of the more granular tokenizer

eterm•35m ago
60k is tiny, if it's making recall mistakes that early then you might have some false memories or incorrect instructions in your CLAUDE.md.

60k isn't much bigger than the system prompt.

danielbln•25m ago
Yeah 60k is ludicrous, I've barely seeded the context at that point and I don't see context related degradation until well into the 600-700k.
da_grift_shift•22m ago
>you might have some false memories or incorrect instructions in your CLAUDE.md

    "YOU'RE HOLDING IT WRONG!"
wg0•28m ago
Not specific to Opus but yes it would make mistakes. I usually try to keep context window under 10%
properbrew•27m ago
I hate to do the "you're holding it wrong" trope, but I think you might have something misconfigured somewhere unless you missed a 0, because just past 60k tokens is such a small context window to be seeing issue in.

Do you have any old documentation that it's picking up and referencing? If you set all claude settings back to default do you see the same issue?

cyanydeez•38m ago
As the gamblers say at the poker table: If you can't figure out who the mark is when you site down...
csomar•22m ago
I have a custom build command for a rust project (yarn build:lib) and my experience is 120k for GLM and roughly 200-300k for Opus. After that, they default to cargo build.
trapexit•16m ago
My projects have specific build/verify steps as well, and after a certain point Claude forgets to run them. I’m going to try a “No brown M&Ms” hook to halt Claude if it tries to run the default command instead of the instructed commands from CLAUDE.md. Perhaps this will be a good signal that a compacted or fresh session is needed at that point to avoid mistakes.
Chirono•9m ago
That’s usually not true due to caching. It may be true if you leave a large gap in between, but if you send “make it red” right after, then it’s purely incremental
ryan_glass•5m ago
"Make the button red" probably doesn't need an LLM at all.
redox99•22m ago
Probably like 1% of the energy an average person spends on driving.
Raphael_Amiard•15m ago
Average american is what you mean
loehnsberg•11m ago
Unless we do our own benchmarks, we have to take all the marketing fluff from the frontier labs at face value, and all public benchmarks degrade eventually as labs optimize towards them. OP’s approach is wasteful because it is brute force, but post says that an ELO is kept, so this is also an experiment, and I don‘t see what‘s wrong with that. You learn which model performs well in which settings which may save resources later. It‘s also wasteful to keep working with the wrong model/harness/tools for too long.

Phoenix LiveView 1.2 Released

https://phoenixframework.org/blog/phoenix-liveview-1-2-released
77•ksec•3h ago•12 comments

Honda Civics and the Evil Valet

https://juniperspring.org/posts/honda-evil-valet/
254•librick•7h ago•42 comments

Don't trust large context windows

https://garrit.xyz/posts/2026-05-06-dont-trust-large-context-windows
58•computersuck•2h ago•41 comments

Free SQL→ER diagram tool, runs in the browser, nothing uploaded

https://sqltoerdiagram.com/
82•robhati•4h ago•17 comments

GLM 5.2 Is Out

https://twitter.com/jietang/status/2065784751345287314
552•aloknnikhil•16h ago•294 comments

Noise infusion banned from statistical products published by Census Bureau

https://desfontain.es/blog/banning-noise.html
814•nl•18h ago•508 comments

Consciousness likely not unique to earthlings, paper says

https://news.ucr.edu/articles/2026/06/10/consciousness-likely-not-unique-earthlings-paper-says
29•giuliomagnifico•3h ago•29 comments

Tribblix: The retro Illumos distribution

http://tribblix.org/
31•naturalmovement•3h ago•7 comments

Every Frame Perfect

https://tonsky.me/blog/every-frame-perfect/
686•ravenical•21h ago•226 comments

Beagle: Git, URIs and all the dirty words

https://replicated.wiki/blog/uris.html
8•gritzko•2d ago•0 comments

Pac-Man, but you're the ghost

https://garrit.xyz/posts/2026-06-13-pac-man-but-you-re-the-ghost
79•mindracer•4h ago•36 comments

Building a serial and VGA "everything console"

http://oldvcr.blogspot.com/2026/06/building-serial-and-vga-everything.html
27•classichasclass•6h ago•1 comments

FreeOberon – Open-Source, Cross-Platform, Free Pascal/Turbo Pascal-Like Language

https://github.com/kekcleader/FreeOberon
79•peter_d_sherman•2d ago•29 comments

Treating pancreatic tumours may have revealed cancer's master switch

https://economist.com/science-and-technology/2026/06/12/treating-pancreatic-tumours-may-have-reve...
354•andsoitis•19h ago•126 comments

Python 3.14 garbage collection rigamarole

https://theconsensus.dev/p/2026/06/06/python-3-14-garbage-collection-rigamarole.html
49•eatonphil•1d ago•29 comments

Pyodide 314.0: Python packages can now publish WebAssembly wheels to PyPI

https://blog.pyodide.org/posts/314-release/
121•agriyakhetarpal•4d ago•27 comments

LaserWriter seeds

https://inventingthefuture.ghost.io/laserwriter-seeds/
13•frizlab•3d ago•0 comments

Weave: Merging based on language structure and not lines

https://ataraxy-labs.github.io/weave/
37•rohanat•5h ago•20 comments

Making Claude a Chemist

https://www.anthropic.com/research/making-claude-a-chemist
34•gmays•5h ago•23 comments

Codex for open source

https://openai.com/form/codex-for-oss/
224•EvgeniyZh•2d ago•87 comments

(Re//Verse 2026) Taxonomy and Deobfuscation of a Real World Binary Obfuscator [pdf]

https://github.com/AnalogCyberNuke/RE-Verse-2026-Slides/blob/main/Reverse26.pdf
16•not_a9•2d ago•1 comments

GameBoy Workboy

https://tcrf.net/Workboy
185•tosh•14h ago•65 comments

Amazon CEO's talks with U.S. officials triggered crackdown on Anthropic models

https://www.wsj.com/tech/ai/amazon-ceos-talks-with-u-s-officials-triggered-crackdown-on-anthropic...
671•ls612•15h ago•495 comments

A low-carbon computing platform from your retired phones

https://research.google/blog/a-low-carbon-computing-platform-from-your-retired-phones/
285•vikas-sharma•23h ago•152 comments

Running DOS on Behringers DDX3216 with a DIY x86-Bios from Scratch

https://chrisdevblog.com/2026/06/08/running-dos-on-behringers-ddx3216-using-a-diy-x86-bios/
91•rasz•14h ago•22 comments

ReactOS (FOSS "Windows") achieves 3D-accelerated Half-Life on real hardware

https://www.phoronix.com/news/ReactOS-Running-Half-Life
209•jeditobe•9h ago•30 comments

Appreciating Exif

https://brentfitzgerald.com/posts/appreciating-exif/
156•burnto•4d ago•33 comments

RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

https://imil.net/blog/posts/2026/rtx-5080-+-rtx-3090-setup-80+-tok-s-on-qwen-3.6-27b-q8/
243•iMil•22h ago•84 comments

Software Architecture Guide (2019)

https://martinfowler.com/architecture/
54•laxmena•4h ago•19 comments

Apt Encounters of the Third Kind

https://igor-blue.github.io/2021/03/24/apt1.html
24•ogurechny•6h ago•5 comments