frontpage.
newsnewestaskshowjobs

Open Source @Github

fp.

Open in hackernews

Coding Agent Memory Benchmarks

1•kushalpatil07•1h ago
Something I’m finding while testing SWE-context-bench for the agent memory layer I’m building: evaluating memory is harder than checking whether the agent solved the next task with fewer tokens.

The setup: An agent solves a coding task. Later, it gets a related task that should benefit from the earlier session. That is the right shape for testing memory. But the details get messy.

Tool use: Sometimes the agent can just web search, inspect the repo, or rediscover the answer.

The task passes, but did memory help? You have to inspect the logs and ask where the answer came from: memory, current codebase, web search, or the model figuring it out again.

So the benchmark is not just measuring success. It is also measuring provenance.

Timeline issues: The benchmark has an original task and a related task.

The related task is supposed to use context from the original task. But sometimes the ordering is weird. The “original” task is effectively from the future, and the “related” task is from the past.

So the repo can already contain the answer that memory was supposed to provide. Dataset issue, completely changes what the score means.

Benchmark gaming: There is also an easy bad strategy: after every task, write a very detailed summary of everything.

If you know the next task will be related, this works.

Now, lets say you solve all of the above problems. Will this still mean your system is good?

Creating a benchmark that actually mimics product performance looks like most of the battle here.

Would love to know a good way to benchmark?

Manifesto for Agentic Teams – reorganizing engineering around AI agents

https://agentic-team-manifesto.org/
1•growt•1m ago•0 comments

WebAssembly Language Tools v0.11.0 is released

https://github.com/g-plane/wasm-language-tools/releases/tag/v0.11.0
1•gplane•3m ago•0 comments

High severity Chrome CVE-2026-11645

2•sensanaty•4m ago•1 comments

The backup SSH daemon I run before every do-release-upgrade

https://ma.ttias.be/backup-sshd-do-release-upgrade/
2•speckx•5m ago•0 comments

240-MP is a retro VCR style front end for content on Raspberry Pi (on a CRT TV)

https://github.com/anthonycaccese/240-MP/tree/main
1•zdw•5m ago•0 comments

SpaceX: The First $100T Company?

https://twitter.com/valmiremini/status/2064715221042704478
1•skenderbeg•6m ago•0 comments

Digesting a codebase before a model reads it

https://matthew-johnston.com/digesting-a-codebase-before-a-model-reads-it/
1•mattjstn•6m ago•0 comments

Everyone Is Buying Tokens. Almost Nobody Is Shipping

https://abhisheksoniai.substack.com/p/everyone-is-buying-tokens-almost
2•MaxMussio•7m ago•0 comments

Cops Keep Getting Arrested for Using Flock to Stalk People

https://www.404media.co/cops-keep-getting-arrested-for-using-flock-to-stalk-people/
1•Brajeshwar•7m ago•0 comments

Britain Became as Poor as Mississippi

https://www.theatlantic.com/magazine/2026/07/uk-productivity-economy-reform-party/687303/
14•SanjayMehta•10m ago•2 comments

We Should Take Text Optimization More Seriously

https://yoonholee.com/blog/2026/we-should-take-text-optimization-more-seriously/
1•gmays•11m ago•0 comments

Finops-scan: Free CLI to scan AWS Cost Explorer for waste (open source, Python)

https://github.com/kamsteph/finops-scan
2•kamsteph•12m ago•0 comments

Ronny Chieng Told Harvard Grads to 'Destroy AI.' They Cheered

https://www.inc.com/jessica-stillman/ronny-chieng-told-harvard-grads-to-destroy-ai-they-cheered/9...
3•1vuio0pswjnm7•14m ago•2 comments

Faster inference won't save you

https://graphcoder.ai/blog/faster-inference-wont-save-you
2•ramstar3000•14m ago•1 comments

The Wrong Epsilon to the Brain

https://hari.computer/the-ledger-mistakes-the-brain
1•andytratt•15m ago•0 comments

Tsunahiro

https://tsunagarujp.mext.go.jp/?lang_id=EN
2•skogstokig•16m ago•0 comments

Oops: A short story about time

https://gabor.monomo.io/oops
3•vajdagabor•17m ago•0 comments

TheBrain on Linux

https://baty.net/posts/2026/06/thebrain-on-linux/
1•speckx•17m ago•0 comments

Show HN: Petiglyph – TUI/CLI to turn images and videos into custom font glyphs

https://github.com/petipoua/petiglyph
1•peti_poua•18m ago•0 comments

Ninety Percent of Job Platforms Sell User Data, Study Finds

https://www.inc.com/bruce-crumley/90-percent-of-job-platforms-sell-user-data-study-finds-here-are...
1•1vuio0pswjnm7•19m ago•1 comments

Narra – offline bilingual e-reader that translates books on-device

https://github.com/dhirajhimani/Narra-public
1•dhrjkmr538•19m ago•0 comments

Show HN: DESi Sees It

https://hstre.github.io/DESi/index.html
1•hstrex•20m ago•0 comments

Bumsrakete: FreeBSD 15 CopyFail Style LPE – Many say the best

https://bumsrake.de
1•whally•21m ago•0 comments

Show HN: A curated collection of simple datasets for machine learning

https://github.com/pplonski/datasets-for-start
3•pplonski86•22m ago•1 comments

I'm launching Tech Influence Watch as AI follows crypto into politics

https://www.citationneeded.news/tech-influence-watch/
1•speckx•23m ago•0 comments

Google Gemini in Workspaces is down

https://www.google.com/appsstatus/dashboard/incidents/CzZUn98mhTcEiCJo27Kv?hl=en
1•nallerooth•24m ago•0 comments

TorchCodec 0.14: HDR Video Decoding for CPU and CUDA, and Fast Wav Decoder

https://github.com/meta-pytorch/torchcodec/releases/tag/v0.14.0
1•scott_s•24m ago•1 comments

Sprite: From Static Mockups to Engine-Ready Game UI

https://arxiv.org/abs/2604.18591
1•PaulHoule•25m ago•0 comments

Explicit Seams as Agent Affordances

https://blog.tacoda.dev/explicit-seams-as-agent-affordances-5f5151dfebe6
2•tacoda•27m ago•0 comments

GnuCash is right. It's also why I built my own finance app

https://k-id.app/blog/gnucash-is-right/
6•tinosar•27m ago•0 comments