with ARK, i’m trying to treat context more like a constrained working set rather than something static. It starts minimal and only expands when there is signal (failures, ambiguity, etc.), so the model isn’t reasoning over stale context.
Curious about your approach — are you leaning more toward:
restructuring how context is stored/retrieved (external memory, RAG, etc.), or dynamically controlling what actually enters the prompt at each step?
Feels like a fundamental bottleneck for production agent systems, so would love to compare how you're thinking about the latency vs accuracy tradeoff.
So before each turn is sent into the LLM we (potentially) run a local process to assemble a bespoke context of only what is required for that specific turn.
If a tool call is not going to be needed on the prompt, we don't include it in the system prompt on that round.
I'm still formalizing the spec at the moment and think I'm about six months to a year out before I have a full human ready UI running.
This is the foundational paper I'm basing the tool on: https://github.com/AlexChesser/ail/blob/main/docs/blog/the-y... while the spec starts here: https://github.com/AlexChesser/ail/blob/main/spec/core/s01-p...
Essentially I'm trying to build an artificial neocortex and frontal lobe to provide a complete layer of Executive Function that operates on top of our agents - like Claude Code (or whatever else).
I'm basing the roadmap on the about 100 years of cognitive science. We've legitimately had names for all these failure modes (in humans) since the 1960's. We have observations of what we're witnessing in agents from 1848.
We have the roadmap from Psychology.
> Feels like a fundamental bottleneck for production agent systems, so would love to compare how you're thinking about the latency vs accuracy tradeoff.
I'm really not focusing on latency right now. My short term goal is to prove the thesis that `ail` can improve same-model performance on SWEBench Pro vs. their own published results.
Can I run swebp with GLM-4.6 and get a score better than their published `68.20` https://www.swebench.com/?
The argument is that the latency right now just isn't the part we should worry about. If we're reducing the time to code something from ~6 weeks to 1 hour... then does it really matter tha we add an other 30 minutes of tool calls if we get it 100% right vs. 80% right?
Make it work -> Make it right -> make it fast.
I'm still on the first one tbh :rofl-emoji:
AlexC04•49m ago
Keep it up! you're on the right track.
Hong, K., & Chroma Research Team. (2025). Context rot: How increasing input tokens impacts LLM performance. Chroma Research. https://research.trychroma.com/context-rot
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12, 157–173. https://doi.org/10.1162/tacl_a_00638