Author here. I benchmarked Mem0 and Zep on MemBench as memory layers for LLM agents, using gpt-5-nano on 4,000 conversational cases and comparing against a long-context baseline.
In this setup, the memory systems were 14–77× more expensive over a full conversation and 31–33% less accurate at recalling facts than just passing the full history. The post shows the results and argues that the shared “LLM-on-write” architecture (running background LLMs to extract/normalize facts on every message) is a bad fit for working memory / execution state, even though it’s useful for semantic long-term memory.
Scope is intentionally narrow: one model, one benchmark (MemBench, 2025), and non-exhaustive configs. The harness (`agentbench`, https://github.com/fastpaca/agentbench) is linked if you want to reproduce or propose a better setup!
cpluss•20m ago
In this setup, the memory systems were 14–77× more expensive over a full conversation and 31–33% less accurate at recalling facts than just passing the full history. The post shows the results and argues that the shared “LLM-on-write” architecture (running background LLMs to extract/normalize facts on every message) is a bad fit for working memory / execution state, even though it’s useful for semantic long-term memory.
Scope is intentionally narrow: one model, one benchmark (MemBench, 2025), and non-exhaustive configs. The harness (`agentbench`, https://github.com/fastpaca/agentbench) is linked if you want to reproduce or propose a better setup!