At mastra.ai we ran the LongMemEval benchmark (500 questions across thousands of conversations) to systematically test our agent memory features. After seeing claims that "RAG is dead for agent memory", we decided to see what was possible.
Starting at a low 65% accuracy, we made some changes to how our memory system works and reached 80% using RAG alone. We ran the benchmark with a series of different configs (since we're a configurable framework) and saw results ranging from 63% with very conservative settings, 74% with small to medium context size, up to 80% with longer context. We accidentally spent $8k and burned 3.8B tokens figuring this out - but it proved that RAG absolutely works for agent memory when properly configured.
tybaa•3h ago
Starting at a low 65% accuracy, we made some changes to how our memory system works and reached 80% using RAG alone. We ran the benchmark with a series of different configs (since we're a configurable framework) and saw results ranging from 63% with very conservative settings, 74% with small to medium context size, up to 80% with longer context. We accidentally spent $8k and burned 3.8B tokens figuring this out - but it proved that RAG absolutely works for agent memory when properly configured.