keep (https://github.com/hughpyle/keep/, MIT-licensed) is a skills practice wrapped around an implementation of "memory for AI agents".
The practice is this: repeated reflection on means and outcomes, so that skillful action improves over time. But the raw implementation of memory is its foundation. Without working memory, you can't iterate.
Similarly, without benchmarks, you can't tell what works. Today we're publishing results for the LoCoMo benchmark.
Scores: 76.2% overall (weighted average)
Single-hop 86.2% (841 questions) Temporal 68.5% (321 questions) Multi-hop 64.2% (282 questions) Open-domain 50.0% (96 questions)
The linked blog post has more detail including industry comparisons. Also links to full repro steps and result data.
This run used local models for embeddings and analysis (nomic-embed-text and llama3.2:3b), and gpt-4o-mini for the query and judge.
Proof point I think that a *local-only* LLM-assisted memory system can achieve solid benchmarks.
Background:
`keep` started with my experience using forgetful agents, and identifying a need for a skill that implements "reflective" memory (not itself new, see e.g. Shinn et al https://arxiv.org/abs/2303.11366) -- here the reflection practice is quite opinionated, saying effectively: what you do is what you become.
Whether *this* works is not a subject of the benchmark.
Docs:
copious documentation at https://docs.keepnotes.ai/guides/