frontpage.

Something I’m finding while testing SWE-context-bench for the agent memory layer I’m building: evaluating memory is harder than checking whether the agent solved the next task with fewer tokens.

The setup: An agent solves a coding task. Later, it gets a related task that should benefit from the earlier session. That is the right shape for testing memory. But the details get messy.

Tool use: Sometimes the agent can just web search, inspect the repo, or rediscover the answer.

The task passes, but did memory help? You have to inspect the logs and ask where the answer came from: memory, current codebase, web search, or the model figuring it out again.

So the benchmark is not just measuring success. It is also measuring provenance.

Timeline issues: The benchmark has an original task and a related task.

The related task is supposed to use context from the original task. But sometimes the ordering is weird. The “original” task is effectively from the future, and the “related” task is from the past.

So the repo can already contain the answer that memory was supposed to provide. Dataset issue, completely changes what the score means.

Benchmark gaming: There is also an easy bad strategy: after every task, write a very detailed summary of everything.

If you know the next task will be related, this works.

Now, lets say you solve all of the above problems. Will this still mean your system is good?

Creating a benchmark that actually mimics product performance looks like most of the battle here.

Would love to know a good way to benchmark?

Manifesto for Agentic Teams – reorganizing engineering around AI agents

WebAssembly Language Tools v0.11.0 is released

High severity Chrome CVE-2026-11645

The backup SSH daemon I run before every do-release-upgrade

240-MP is a retro VCR style front end for content on Raspberry Pi (on a CRT TV)

SpaceX: The First $100T Company?

Digesting a codebase before a model reads it

Everyone Is Buying Tokens. Almost Nobody Is Shipping

Cops Keep Getting Arrested for Using Flock to Stalk People

Britain Became as Poor as Mississippi

We Should Take Text Optimization More Seriously

Finops-scan: Free CLI to scan AWS Cost Explorer for waste (open source, Python)

Ronny Chieng Told Harvard Grads to 'Destroy AI.' They Cheered

Faster inference won't save you

The Wrong Epsilon to the Brain

Tsunahiro

Oops: A short story about time

TheBrain on Linux

Show HN: Petiglyph – TUI/CLI to turn images and videos into custom font glyphs

Ninety Percent of Job Platforms Sell User Data, Study Finds

Narra – offline bilingual e-reader that translates books on-device

Show HN: DESi Sees It

Bumsrakete: FreeBSD 15 CopyFail Style LPE – Many say the best

Show HN: A curated collection of simple datasets for machine learning

I'm launching Tech Influence Watch as AI follows crypto into politics

Google Gemini in Workspaces is down

TorchCodec 0.14: HDR Video Decoding for CPU and CUDA, and Fast Wav Decoder

Sprite: From Static Mockups to Engine-Ready Game UI

Explicit Seams as Agent Affordances

GnuCash is right. It's also why I built my own finance app

Coding Agent Memory Benchmarks