I'm an independent researcher and I've just published a whitepaper for an LLM memory architecture I designed: DREAM (Dynamic Retention Episodic Architecture for Memory).
The core problem I'm tackling is the tension between persistent memory and cost in large-scale AI systems. Storing nothing forces users to re-explain context. Storing everything creates a privacy, latency, and cost nightmare.
DREAM is a plug-in architectural pattern that sits around the LLM (no model changes needed) and unifies existing tech (RAG, NoSQL).
The core innovation is the Adaptive Retention Mechanism (ARM).
Instead of a static 30-day TTL, ARM dynamically extends an episode's TTL based on user engagement. For example, a memory's life doubles each time the user revisits it (e.g., 7 days -> 14 -> 28 -> 56) .
This creates a "self-pruning" memory layer where storage cost scales directly with actual user relevance, not just raw traffic.
The architecture also includes:
Episodic Units (EUs): Storing compressed summaries + embeddings, not raw logs.
User-Centric Opt-In: Explicit user approval per episode for privacy.
Aligned Sharding: A design for sharding orchestrators and storage (partitioned by user_id) to ensure horizontal scalability and cache locality .
I designed DREAM to be a practical blueprint implementable with today's infrastructure (Cassandra, FAISS, Kubernetes, etc.).
I don't have the resources to test this at scale, so I'm publishing the architecture to share the idea. I would be genuinely grateful for any technical feedback, criticisms, or thoughts on the design.
Whitepaper (PDF): https://zenodo.org/records/17619917 GitHub (Code Examples/Arch): https://github.com/MatheusPereiraSilva/dream-architecture