These are all SOTA scores on agentic memory benchmarks. None of them tell you whether the system will work in production.
The deeper problem isn't the data — it's that we often misunderstand what these numbers actually measure. In our recent white paper we open-sourced datasets that target specific memory functions. Today we published a follow-up that explains why we think the well-known agentic memory benchmarks (LoCoMo, LongMemEval) miss the mark for production systems, and what we measure instead.
We're in a field that is measuring itself against itself. The real question isn't 'are we beating last week's leaderboard?' — it's 'are we building something that makes people's work meaningfully better?' That's harder to measure. It's also the only thing that matters.
alex_petrov•1h ago
These are all SOTA scores on agentic memory benchmarks. None of them tell you whether the system will work in production.
The deeper problem isn't the data — it's that we often misunderstand what these numbers actually measure. In our recent white paper we open-sourced datasets that target specific memory functions. Today we published a follow-up that explains why we think the well-known agentic memory benchmarks (LoCoMo, LongMemEval) miss the mark for production systems, and what we measure instead.
We're in a field that is measuring itself against itself. The real question isn't 'are we beating last week's leaderboard?' — it's 'are we building something that makes people's work meaningfully better?' That's harder to measure. It's also the only thing that matters.