frontpage.
newsnewestaskshowjobs

Open Source @Github

fp.

Open in hackernews

How to benchmark persistent repo memory for coding agents

https://autoloops.ai/greplica/blog/benchmarking-greplica/
2•kushalpatil07•1h ago

Comments

kushalpatil07•1h ago
Greplica is a context layer for your coding agents. It stores info about your current architecture, decisions, nuances etc from your code and sessions, and gives it to your agent before it starts exploring. This information is something that you would explain to a dev on how a particular thing works. Idea is if we are able to maintain this information, the agent will not need to grep through a 100 files to discover the same thing, and save tokens/time, and using prior decision history improve on coding itself.

Benchmark is created from SWE-Chat dataset, which are real coding sessions of users on open source projects.

The benchmark setup is temporal:

take prior coding-agent sessions from a repo

build memory only from those prior sessions

hold out a later session from the same repo

run the same planning task at the same pre-task commit

compare baseline vs memory-assisted agent

The held-out session is not used while building memory.

The agent only gets access to repo memory created from earlier work: architectural facts, subsystem behavior, gotchas, failed attempts, implementation notes, constraints, etc. Each memory item is tied back to evidence from files/commits/sessions.

On the selected 10 high-context planning tasks, Greplica reduced:

cost by 43%

tokens by 49%

tool calls by 36%

elapsed planning time by 26%

Tried to benchmark on coding tasks as well, but that becomes difficult because coding trajectories can vary a lot, an agent might end up running tests each time it codes, the other may not.

There were other interesting results as well. Not perfected but would love to share.

Variance:

Running the same task multiple times without memory can produce very different planning traces.

Sometimes the agent finds the right subsystem quickly.

Sometimes it burns a lot of tokens exploring irrelevant files, gets anchored on the wrong abstraction, or only discovers the important context late in the run.

That makes single-run agent benchmarks pretty noisy.

Memory seems to reduce this variance because the early part of planning changes. The agent is no longer doing broad repo archaeology from zero. It starts with a smaller set of relevant claims, then uses repo exploration to verify and fill gaps.

Greplica vs docs-folder

The second thing we are benchmarking now is Greplica vs a docs-folder baseline.

The obvious baseline is:

“Why not just write all prior session memory into markdown files and let the agent read them?”

At small docs sizes, this actually works quite well.

Quality is similar. Token usage is also similar. There are only a few files, so the agent can cheaply scan them.

But as more sessions are ingested, docs-folder goes to shit. Seen in cases where ingested sessions changed from 3 to 11.

Greplica improves because there is more prior engineering context to retrieve from, and there is an optimized retrieval pipeline that gets you relevant stuff.

The docs folder gets worse on token usage because it slowly becomes another codebase. The agent now has to search the docs, rank relevance, detect stale notes, resolve conflicts, and decide which facts to consider.

So the bottleneck moves from storage to retrieval. This slowly turns to a retrieval problem.

Repo: https://github.com/Autoloops/greplica

Dolosse – a South African invention used over the world

https://thisbugslife.com/2021/11/21/dolosse-a-south-african-invention-used-over-the-world/
1•andsoitis•1m ago•0 comments

Forest Brothers Game: Survive the Russian Cold War Occupation of Estonia

https://www.rebootinganation.com/forest-brothers
1•atlasunshrugged•4m ago•1 comments

Book on probability and statistics for data science with videos / code

https://www.ps4ds.net/
1•levmarq•4m ago•1 comments

AI Humanoid Robot Companions

https://www.reuters.com/technology/chinas-ubtech-launches-ai-powered-lifelike-companion-robots-20...
1•takerofnaps•5m ago•0 comments

Sitting for more than 30 minutes increases the risk of dying from cancer

https://journals.plos.org/plosmedicine/article?id=10.1371%2Fjournal.pmed.1004767
2•BiosIT•12m ago•0 comments

Simple White Line Is America's Greatest Unsung Innovation

https://www.wsj.com/business/white-line-road-invention-america-250-8ce6bb89
1•erex78•13m ago•1 comments

When the ability to smell goes away

https://arstechnica.com/science/2026/07/when-the-ability-to-smell-goes-away/
1•Brajeshwar•15m ago•0 comments

Static Types Come to the Beam – Annette Bieniusa and Guillaume Duboc [video]

https://www.youtube.com/watch?v=X_CPDt3PeDE
1•markoutso•15m ago•0 comments

Positioning Without Satellites or Base Stations

https://hackaday.com/2026/07/01/positioning-without-satellites-or-base-stations/
1•DarkContinent•16m ago•0 comments

Do Wavy Walls Use Fewer Bricks? I Tested It in Blender

https://blog.tymscar.com/posts/crinklecranklewalls/
1•tymscar•19m ago•0 comments

It's time to go back to the founding text

https://www.theguardian.com/us-news/ng-interactive/2026/jul/04/250-years-declaration-of-independence
2•classified•21m ago•0 comments

Show HN: Home Page as a Chatbot

https://github.com/haltakov/chatbot-page
1•vladoh•21m ago•0 comments

Show HN: World Release Notes – Every country as a software project

https://worldreleasenotes.com/
1•7rin0•25m ago•0 comments

New York City Real-Time Subway Status Visualization

https://subway.joonas.wtf/
1•bookofjoe•28m ago•0 comments

Show HN: GREF – Interactive search and replace for terminal and Vim

https://github.com/albertize/gref
1•albertize•29m ago•0 comments

Where's the holistic AI productivity data?

https://rachelandrew.co.uk/archives/2026/06/11/wheres-the-holistic-ai-productivity-data/
1•tobr•31m ago•0 comments

Sylix, an free alternative of Cursor & Copilot

https://sylixide.com/
1•Sai-09•33m ago•0 comments

Show HN: I built a personalized AI newsletter you configure by replying to it

https://briefednewsletter.com/
1•tozcoded•37m ago•0 comments

Is Israel's 'buffer zone' inside Lebanon an attempt to grab gas reserves?

https://www.aljazeera.com/features/2026/6/12/is-israels-buffer-zone-inside-lebanon-an-attempt-to-...
2•hebelehubele•40m ago•0 comments

Show HN: Thèque, a private visual library for the things you save

https://theque.app
2•ecuzmici•41m ago•0 comments

An Analyst's Missed Remark Surfaced in Deadly Iran School Strike Probe

https://www.bloomberg.com/news/features/2026-06-26/an-analyst-s-missed-remark-surfaced-in-deadly-...
1•r721•42m ago•1 comments

Windows CE Dreamcast Community Edition (wince-dc)

https://github.com/maximqaxd/wince-dc
3•msephton•42m ago•0 comments

Towards a formal theory of computer insecurity: a language-theoretic approach [video]

https://www.youtube.com/watch?v=AqZNebWoqnc
1•binyu•44m ago•0 comments

Why AI Gurus Are Building Toys While the World Needs Architects

https://medium.com/@alanscottencinas/the-scale-wall-why-ai-gurus-are-building-toys-while-the-worl...
2•encinas88•44m ago•1 comments

1666 Great Fire of London

https://en.wikipedia.org/wiki/Great_Fire_of_London
2•simonebrunozzi•45m ago•0 comments

Introduction to Conditional Flow Matching – Part I, Normalizing Flows

https://huet.ing/posts/cfm_part_i/
1•DeanMoriarty123•46m ago•1 comments

Tell HN: Check your subscription renewals (VPNs, etc.). They cost too much

2•simonebrunozzi•48m ago•1 comments

China's electromagnetic rocket launch technology

https://timesofindia.indiatimes.com/science/chinas-electromagnetic-rocket-launch-technology-could...
3•Tomte•51m ago•0 comments

How little exercise can you get away with?

https://www.economist.com/science-and-technology/2026/07/03/how-little-exercise-can-you-get-away-...
4•Brajeshwar•52m ago•2 comments

I found a malware hiding in my TailwindCSS config file

https://infosecwriteups.com/i-found-north-korean-dprk-malware-hiding-in-my-tailwind-config-js-45a...
6•donohoe•52m ago•3 comments