We’ve been building memU(https://github.com/NevaMind-AI/memU), an open-source, general-purpose memory framework for AI agents. It supports dual-mode retrieval: classic RAG and LLM-based direct file reading.
Most multimodal memory systems either embed everything into vectors or treat non-text data as attachments. These work, but at scale it becomes hard to explain why certain context was retrieved and what evidence it relies on.
memU takes a different approach: since models reason in language, multimodal memory should converge into structured, queryable text, while remaining fully traceable to original data.
---
## Three-Layer Architecture
- Resource Layer Stores raw multimodal data as ground truth. All higher-level memory remains traceable to this layer.
- Memory Item Layer Extracts atomic facts from raw data and stores them as natural-language statements. Embeddings are optional and used only for acceleration.
- Memory Category Layer Aggregates items into readable, theme-based memory files (e.g. user preferences, work logs). Frequently accessed topics stay active; low-usage content is demoted to balance speed and coverage.
---
## Memorization Bottom-up and asynchronous. Data flows from resources → items → category files without manual schemas. When capacity is reached, recently relevant memories replace the least used ones.
## Retrieval Top-down. memU searches category files first, then items, and only falls back to raw data if needed. At the item layer, it combines BM25 + embeddings to balance exact matching and semantic recall, avoiding embedding-only imprecision.
Dual-mode retrieval lets applications choose between: - low-latency embedding search, or - LLM-based direct reading of memory files.
## Evolution Memory structure adapts automatically based on real usage: - Frequently accessed memories remain at the Category layer - Memories retrieved from raw data are promoted upward and linked - Organization evolves from usage patterns, not predefined rules
Goal: keep relevant memories retrievable at the Category layer and minimize latency over time.
---
## A Unified Multimodal Memory Pipeline memU is a text-centered multimodal memory system. Multimodal inputs are progressively converted into interpretable text memory, while staying traceable to original data. This provides stable, high-level context for reasoning, with detailed evidence available when needed—inside a memory structure that evolves through real-world use.
Junnn•1d ago
Most agent memory stacks today collapse everything into embeddings and hope similarity search is enough. That works for recall, but breaks down quickly when you need traceability, temporal reasoning, or explanation of why something was remembered.
The layered design here (raw resources → extracted memory items → categorized memory files) feels much closer to how we design real systems: separation of concerns, clear abstraction boundaries, and the ability to reason about state changes over time.
Storing memories in human-readable form also makes debugging and evolution practical. You can audit what the agent “knows”, adjust policies, or let the LLM reason directly over memory instead of treating it as a black box vector store.
Embeddings still make sense as an optimization layer, but making them optional rather than foundational is an important architectural choice if agents are meant to run long-term and stay coherent.
This feels less like a retrieval hack and more like actual infrastructure.