Ark – AI agents waste ~30% of context on tool schemas.I built runtime that learn

1•atripat6•1h ago

Comments

AlexC04•49m ago

this is a pretty important piece and the research backs you up. Moving that context out of your system prompt dynamically is going to help reduce your lost in the middle effect. Context rots almost immediately. I've got a project that is being built to address this directly as well, but I'm still very early days.

Keep it up! you're on the right track.

Hong, K., & Chroma Research Team. (2025). Context rot: How increasing input tokens impacts LLM performance. Chroma Research. https://research.trychroma.com/context-rot

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12, 157–173. https://doi.org/10.1162/tacl_a_00638

atripat6•32m ago

really appreciate the pointer to the Chroma research, the context rot framing matches what i've been seeing. Even with large context windows, the signal-to-noise drops quickly, especially when tool schemas are included upfront but not actually used.

with ARK, i’m trying to treat context more like a constrained working set rather than something static. It starts minimal and only expands when there is signal (failures, ambiguity, etc.), so the model isn’t reasoning over stale context.

Curious about your approach — are you leaning more toward:

restructuring how context is stored/retrieved (external memory, RAG, etc.), or dynamically controlling what actually enters the prompt at each step?

Feels like a fundamental bottleneck for production agent systems, so would love to compare how you're thinking about the latency vs accuracy tradeoff.

AlexC04•18m ago

so - my approach is still being built and I'm still very hand wavy around how it is going to come together, but effecively I'm building pipelines of prompts. Rather than running our LLM sequences as long running sessions where the entire context gets loaded on every turn (a recipe for rot), we unlock the ability to introduce a thinking layer at each step in between the process.

So before each turn is sent into the LLM we (potentially) run a local process to assemble a bespoke context of only what is required for that specific turn.

If a tool call is not going to be needed on the prompt, we don't include it in the system prompt on that round.

I'm still formalizing the spec at the moment and think I'm about six months to a year out before I have a full human ready UI running.

This is the foundational paper I'm basing the tool on: https://github.com/AlexChesser/ail/blob/main/docs/blog/the-y... while the spec starts here: https://github.com/AlexChesser/ail/blob/main/spec/core/s01-p...

Essentially I'm trying to build an artificial neocortex and frontal lobe to provide a complete layer of Executive Function that operates on top of our agents - like Claude Code (or whatever else).

I'm basing the roadmap on the about 100 years of cognitive science. We've legitimately had names for all these failure modes (in humans) since the 1960's. We have observations of what we're witnessing in agents from 1848.

We have the roadmap from Psychology.

AlexC04•11m ago

to directly answer this bit:

> Feels like a fundamental bottleneck for production agent systems, so would love to compare how you're thinking about the latency vs accuracy tradeoff.

I'm really not focusing on latency right now. My short term goal is to prove the thesis that `ail` can improve same-model performance on SWEBench Pro vs. their own published results.

Can I run swebp with GLM-4.6 and get a score better than their published `68.20` https://www.swebench.com/?

The argument is that the latency right now just isn't the part we should worry about. If we're reducing the time to code something from ~6 weeks to 1 hour... then does it really matter tha we add an other 30 minutes of tool calls if we get it 100% right vs. 80% right?

Make it work -> Make it right -> make it fast.

I'm still on the first one tbh :rofl-emoji:

Tell HN: Hawaiian Airlines app showing someone else's flight info

Paper Tape Is All You Need – Training a Transformer on a 1976 Minicomputer

The Brigade System Helps Restaurants Succeed. Does It Also Lead to Abuse?

Before Leon AI 2.0, I want to say this

Show HN: Let Me Emoji That for You

I said code review was dead. Here's what I got wrong – and right

Agent-browser – Browser automation CLI

Building an AI Data Analyst Sucks

Code Review usage will count toward Codex limit instead of having separate limit

Show HN: Building an open-source product demo platform

Children of Heaven

Microscope Light

"Roadrunner": a bipedal, wheeled robot for multi-modal locomotion [video]

macOS Tips (2024)

Show HN: New Hacker News Watchlists Crome Extension

The Casino That's Eating the World

NASA Sets Out New Plans and Timelines for Moon Base and Nuclear Mars Mission

Show HN: I build a free feed reader that reimagines what is possible with RSS

An Extensive Benchmark of C and C++ Hash Tables

My quest to preserve VHS- era gaming culture, one eBay bid at a time

The best design style extraction and reuse skill on the market.

I Put a Full JVM Inside a Browser Tab. It "Works". Technically. Eventually

Supreme Court Sides with Cox in Copyright Fight over Pirated Music

I built a game where an AI judges whether things deserve each other

Malicious Litellm 1.82.8: Credential Theft and Persistent Backdoor

I built a YAML DSL for Temporal workflows

A Rare Blog

Comprehensive C++ Hashmap Benchmarks (2022)

I made a college punching bag for rejected highschoolers

Elon Musk demands judge's recusal over LinkedIn post after $2B verdict