How to benchmark persistent repo memory for coding agents

https://autoloops.ai/greplica/blog/benchmarking-greplica/

2•kushalpatil07•1h ago

Comments

kushalpatil07•1h ago

Greplica is a context layer for your coding agents. It stores info about your current architecture, decisions, nuances etc from your code and sessions, and gives it to your agent before it starts exploring. This information is something that you would explain to a dev on how a particular thing works. Idea is if we are able to maintain this information, the agent will not need to grep through a 100 files to discover the same thing, and save tokens/time, and using prior decision history improve on coding itself.

Benchmark is created from SWE-Chat dataset, which are real coding sessions of users on open source projects.

The benchmark setup is temporal:

take prior coding-agent sessions from a repo

build memory only from those prior sessions

hold out a later session from the same repo

run the same planning task at the same pre-task commit

compare baseline vs memory-assisted agent

The held-out session is not used while building memory.

The agent only gets access to repo memory created from earlier work: architectural facts, subsystem behavior, gotchas, failed attempts, implementation notes, constraints, etc. Each memory item is tied back to evidence from files/commits/sessions.

On the selected 10 high-context planning tasks, Greplica reduced:

cost by 43%

tokens by 49%

tool calls by 36%

elapsed planning time by 26%

Tried to benchmark on coding tasks as well, but that becomes difficult because coding trajectories can vary a lot, an agent might end up running tests each time it codes, the other may not.

There were other interesting results as well. Not perfected but would love to share.

Variance:

Running the same task multiple times without memory can produce very different planning traces.

Sometimes the agent finds the right subsystem quickly.

Sometimes it burns a lot of tokens exploring irrelevant files, gets anchored on the wrong abstraction, or only discovers the important context late in the run.

That makes single-run agent benchmarks pretty noisy.

Memory seems to reduce this variance because the early part of planning changes. The agent is no longer doing broad repo archaeology from zero. It starts with a smaller set of relevant claims, then uses repo exploration to verify and fill gaps.

Greplica vs docs-folder

The second thing we are benchmarking now is Greplica vs a docs-folder baseline.

The obvious baseline is:

“Why not just write all prior session memory into markdown files and let the agent read them?”

At small docs sizes, this actually works quite well.

Quality is similar. Token usage is also similar. There are only a few files, so the agent can cheaply scan them.

But as more sessions are ingested, docs-folder goes to shit. Seen in cases where ingested sessions changed from 3 to 11.

Greplica improves because there is more prior engineering context to retrieve from, and there is an optimized retrieval pipeline that gets you relevant stuff.

The docs folder gets worse on token usage because it slowly becomes another codebase. The agent now has to search the docs, rank relevance, detect stale notes, resolve conflicts, and decide which facts to consider.

So the bottleneck moves from storage to retrieval. This slowly turns to a retrieval problem.

Repo: https://github.com/Autoloops/greplica

Dolosse – a South African invention used over the world

Forest Brothers Game: Survive the Russian Cold War Occupation of Estonia

Book on probability and statistics for data science with videos / code

AI Humanoid Robot Companions

Sitting for more than 30 minutes increases the risk of dying from cancer

Simple White Line Is America's Greatest Unsung Innovation

When the ability to smell goes away

Static Types Come to the Beam – Annette Bieniusa and Guillaume Duboc [video]

Positioning Without Satellites or Base Stations

Do Wavy Walls Use Fewer Bricks? I Tested It in Blender

It's time to go back to the founding text

Show HN: Home Page as a Chatbot

Show HN: World Release Notes – Every country as a software project

New York City Real-Time Subway Status Visualization

Show HN: GREF – Interactive search and replace for terminal and Vim

Where's the holistic AI productivity data?

Sylix, an free alternative of Cursor & Copilot

Show HN: I built a personalized AI newsletter you configure by replying to it

Is Israel's 'buffer zone' inside Lebanon an attempt to grab gas reserves?

Show HN: Thèque, a private visual library for the things you save

An Analyst's Missed Remark Surfaced in Deadly Iran School Strike Probe

Windows CE Dreamcast Community Edition (wince-dc)

Towards a formal theory of computer insecurity: a language-theoretic approach [video]

Why AI Gurus Are Building Toys While the World Needs Architects

1666 Great Fire of London

Introduction to Conditional Flow Matching – Part I, Normalizing Flows

Tell HN: Check your subscription renewals (VPNs, etc.). They cost too much

China's electromagnetic rocket launch technology

How little exercise can you get away with?

I found a malware hiding in my TailwindCSS config file