- Baseline RAG (embedding similarity only): 10%
- RAG + reranker: 20%
- Outcomes only (no reranker): 60%
- RAG + outcome scoring (mature memories with 20+ uses): 60%
"Accuracy" = correct memory ranked #1 for the query. The outcome scoring uses Wilson score lower bound - memories that consistently get positive feedback from the "user" get boosted, ones that fail get demoted.
Test methodology: https://github.com/roampal-ai/roampal/blob/main/dev/benchmar...
I think this is also a kind of tagging.
mistrial9•1h ago
roampal•1h ago
udfalkso•48m ago